About this post
ABOUT: This entry was posted January 25, 2009 at 9:26 p.m. It is 765 words long, which, in case you're curious, translates to about 21 inches. There are currently 1 comments on this post. Click here to add your own.
SUMMARY: Making the case for applying emerging technologies.
TAGS: Data mining
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
Text mining: more than a just neat new toy
It's not lost on me that I sometimes think backwards. I find a new toy and then justify reasons to use it. This text mining thing could look like another case of the same: I went to a conference, saw a cool new hammer, and now everything looks like a nail.
But this is different. And I'll tell you why.
The currency of journalism is text. We take notes, read documents, write stories. Almost everything we do, everything we work with and everything we produce comes in the form of unstructured text. Sure, databases and multimedia make up a growing share of our content, but we remain -- for better or worse -- a primarily text-driven enterprise.
As plenty of people have argued before, most of the information within our text collections never gets used. That goes for both the reporting and research end and our final product. Tragic, right?
Think about our reporting process: We go out, harvest a bunch of information, pick out the best bits and distill them into a story. At best, the rest gets stored in a reporter's memory -- a valuable but limited storage device that is prone to a.) Forgetting, and b.) Being laid off. At worst, that information could be disregarded altogether.
Contained within all those documents, all those interviews, is useful data. Duh. But it's not just the contents of the documents or the words said in the interviews, it's metadata as well: The names of people mentioned in a report, connected to a topic, to organizations, and to each other. Or the mood and tone of an interview as related to an event, an area of town, or even the weather. These bits of data have value in themselves, but they have even more value together, gathered over time for weeks, years or decades.
We haven't yet figured out how to collect all that stuff, which is a problem on its own. But just as there are now tools for that sort of thing, the tools for annotating and analyzing this unstructured data have made great strides as well. Most of us just haven't taken note of them. That's what SEASR is all about.
Let's look at another example: Newspaper archives. I'll be writing a lot more about this later, but think, on a very basic level, of what archives represent. If a story is a selective distillation of facts collected during the reporting process, an archive is a newspaper's collective memory -- a storehouse of institutional knowledge that in some cases goes back more than 100 years. It's a selective and imperfect memory, to be sure, but it's the best we have. And the data contained within it is extremely valuable.
Used to be, we wanted to find all the stories with Mayor Joe Smith, all we could do is run a search. And that's still fine -- it gives us 95 percent of what we want in a format we're used to. But what if we could use that archive to construct a network of the mayor's supporters; or if we want to plot the events that brought him from from obscurity to public prominence. These are the the kinds of things the best beat reporters know off the top of their heads. But hints of those patterns -- and others we haven't thought of -- are buried in work we've already done. We just need to find them.
A few years ago, I would have written this off on complexity alone. Data mining and machine learning was too difficult. You didn't need a PhD in computer science to understand the basics, but you probably did need a master's. Either that or a whole lot of free time (which I'm sure we've all been swimming in lately).
But things have changed. And that's what this workshop, to some extent, was about. You can debate the pros and cons of tools like SEASR, but its existence alone proves one critical point: These text and data mining tools are getting simpler. They're becoming more accessible. Those of us like me -- hobbyist geeks with bachelor's degrees -- can start to understand and apply them.
We don't need to invent the next Google, but we need to understand the possibilities created by emerging technologies and be creative in the ways we apply them. For me, that's advancing our public service mission. For you, that might be fixing journalism's broken business model.
Whatever it is, we need to keep our eyes open. Just using the same tools and the same methods will hold us back.

Comments | Post yours
Post your comment