About this post
ABOUT: This entry was posted February 2, 2009 at 7:32 p.m. It is 710 words long, which, in case you're curious, translates to about 20 inches. There are currently 15 comments on this post. Click here to add your own.
SUMMARY: From rows and columns to libraries of text.
TAGS: Data mining
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
Text is the new data
Structured data is so 2006.
Or at least it should be. I look at 2005 as the year when the online database revolution first started to pick up steam. It hadn't quite reached critical mass, but the case that data is valuable in its own right was finally finding some ears.
Now here we are, a few years later, and newspapers have embraced the database. You can argue that they haven't made the best use out of data, or that data ghettos and salacious Caspio databases aren't the way to go, but at least reporters and editors know what databases are -- and more important, why they are valuable.
But looking ahead, we have a new opportunity: unstructured data -- all that stuff stored in documents, speeches, interviews and our own archives. In terms of utility, moneymaking potential, and our role in shaping democracy, text is the new data.
Gathering momentum
Am I crazy? Long before databases were in vogue, journalism had a dedicated community of practitioners -- NICAR -- that not only understood and advocated for their value, but knew how to manipulate them. No such community has formed around unstructured data, but the foundations are being set.
At its annual conference in Indianapolis, NICAR this year will for the first time hold several panels on honest-to-goodness text mining. The topic has been explored in conferences past, but never with the gusto planned for 2009.
In addition, I have spoken with at least one large newspaper company that is laying infrastructure at the corporate level that could facilitate large-scale text-mining of news content. Whether and how they will use it, I have no idea. But the prospect is exciting.
And let's not forget The New York Times, which has used text-mining tools internally for years. Anyone who watched them demonstrate FAST-based tools several years ago knows how impressive their operation is. The work their online group has done visualizing speeches in recent months has been impressive, and the proposed NYT/ProPublica-backed DocumentCloud project aims to find a standards-based solution to the long-time problem of document collection.
The OpenCalais project, which performs the critical but complex task of entity extraction as a Web service, is being driven by another media company: Thomson Reuters. And exciting tools that could simplify high-end text processing are springing forth from academic disciplines like the digital humanities and library sciences. Even data entry and document processing as a human service has gotten cheaper.
New opportunities
What this means for media companies is that new opportunities for innovation are opening up, with new potential to make money and serve the public good. Anyone with a solid grasp on technology and a desire to stay ahead of the curve should take note.
Some of the questions worth asking: Is there data within our archives that could be useful for holding public officials accountable? Maybe promises they've made? Quotes and public statements? What about advertisers: Is there anything buried in our archives that could be used to gauge trends, show useful relationships within a community, or in other ways help them -- and us -- make money?
From a reporting standpoint, what is the most effective way to harvest all the documents we collect on a daily basis? Is there a way to seamlessly integrate document, or even interview digitization into the daily workflow? How would we go about finding patterns in those large document storehouses? How could the public make use of our source material? Could a sufficiently large and intelligently mined warehouse of documents, stories, interviews and other material collectively augment the institutional memory of a news organization?
Granted, these questions are a bit bigger than "How do we drive traffic with our new whiz-bang database?" but that's the point. We're sitting on tremendous amounts of data -- archives, documents, interviews -- that we aren't even trying to use. The conversation about how we could use unstructured data has begun in fits and starts, but I hope it soon builds the same momentum that the structured data push did. I hate thinking we might use more of our innovation muscle on searchable databases than exploring rich topics like this.
If we don't do it, someone else will.

Comments | Post yours
Post your comment