About this post

ABOUT: This entry was posted January 19, 2009 at 7:29 p.m. It is 703 words long, which, in case you're curious, translates to about 20 inches. There are currently 88 comments on this post. Click here to add your own.

SUMMARY: Starting a conversation on text mining in journalism.

TAGS: Data mining


Spread the love


Recent posts

Monday, February 2nd, 2009
From rows and columns to libraries of text.

Sunday, January 25th, 2009
Making the case for applying emerging technologies.

Monday, January 19th, 2009
Starting a conversation on text mining in journalism.

Sunday, September 28th, 2008
In which I explain the advantages of using a popular fraud-detection tool in reporting.

Monday, March 17th, 2008
In which I describe how to use MySQL's spatial functions and Python to do point-in-polygon detection.

SEASR, text mining and journalism

Posted Monday, January 19th, 2009 at 7:29 p.m.

The idea of extracting structured data from unstructured data isn't new, even in the arcane world of journalism. Several years ago, Adrian Holovaty made a strong case for journalism microformats, which unfortunately hasn't yet caught on. Others, such as Derek Willis, have made the argument from the Big Ideas level.

Last week, I spent two days at the National Center for Supercomputing Applications at the University of Illinois playing with software that could go a long way toward solving the problem. Somehow, I was lucky enough to join Brant Houston (now a Knight Chair at UI), Jennifer LaFleur (ProPublica), David Donald (The Center for Public Integrity) and Jaimi Dowdell (IRE/NICAR) at a digital humanities workshop focused on a software framework known as SEASR.

It was a blast -- not so much for the software itself, but as a formal introduction to a research area that has been doing very exciting things with unstructured data. I've been playing with this stuff for years, but the time I spent in Illinois pounded these concepts into my head than any book I've ever read, or any tinkering I've ever done.

Given that, and a couple inspiring conversations I've had lately, I've been thinking a lot about how these technologies (and others) could go a long way toward helping solve some of the problems that keep us journo-geeks up at night.

But first, the basics.

SEASR

SEASR is a framework of language analysis tools designed for humanities and social science researchers. Sitting on top of it is a Web-based application known as Meandre, which provides a Web-based visual programming environment used for building series of tasks known as "flows". These flows do different stuff, both to the rows-and-colums data we're used to as well as globs unstructured text, such as books, articles and speeches.

Flows are made of components -- modules that perform specific tasks, such as opening a URL, importing a CSV file, tagging parts of speech in a document, or producing a visualization based on standardized data. Together, the components can do simple things, such as produce a tag cloud, or more complex tasks, like analyzing clusters of words and themes in large bodies of text. In programming terms, they're a lot like a chain of functions.

The software was built to help scholars in fields such as digital humanities, where reseachers use computers to look for patterns and outliers in free text (journalists could learn a lot from these guys). Because it was designed for researchers, not computer scientists, it's probably one of the most accessible toolkits out there for text analysis.

But how can it help journalists?

Being steeped in rows and columns, CAR has its limitations. For one, we need our data to be organized in -- you guessed it -- rows and columns. But think of all the data that isn't kept that way: information in speeches or city council agendas, for example.

Journalists have done some work in this area, building visualizations like tag clouds, and parsing information from unstructured text using regular expressions and other rule-based processors. The NY Times has some fascinating in-house tools that pull names from documents and automatically run them through databases. But think of what more could be possible:

Mixing named entity extraction with language analysis, algorithms could comb through public statements, policy papers and newspaper archives to extract every promise a politician has ever made. Other techniques could measure public sentiment across blog posts, Twitter messages, newspaper comments and more -- a new way of taking the public pulse that might be more accurate than man-on-the-street interviews.

The group of us who attended the conference is working on applying the tools to a more conventional investigative project, which we will share more details about in time.

What's next

I could drone on for hours about the possibilities here (and I probably will over the coming weeks), but the takeaway here is that tools for these sophisticated analyses are quickly becoming more accessible. Hopefully you'll see some examples at the NICAR conference in Indy.

Take a look at some of the links above and let me know what you think: Got any pie-in-the-sky ideas?

Comments | Post yours

Post your comment

Optional