About this post
ABOUT: This entry was posted January 2, 2007 at 9 p.m. It is 598 words long, which, in case you're curious, translates to about 17 inches. There are currently 12 comments on this post. Click here to add your own.
SUMMARY: Why CAR specialists should consider learning the basics of data-mining.
TAGS: Data mining | Databases
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
A data mining discussion
Note: This article was originally posted in February 2006
I’ve been meaning to blog about data mining for a long time. The opportunities it presents and the stories it can reveal have convinced me that it should be one of the next big movements in analytic journalism. But except for sparse mentions on NICAR-L, the CAR community doesn’t seem to have mentioned it much. Not that I blame anyone – I slogged through a book on the subject and I still don’t get it – but I can't help but think that journalists will be missing out if this opportunity passes without a more vigorous discussion.
So what is data mining? Hitch a few academic definitions together and you pretty much come up with: the extraction of useful and previously unknown information from large datasets, often using automation (as opposed to querying the data explicitly to, say, test a hypothesis). The tools of the trade are database managers, programming languages, and, ideally, statistical suites like SAS and SPSS. Businesses use it to see who buys what. Cops use it to locate crime hotspots. The government says it has used it to identify terrorists. All the while, journalists have been missing out.
I wonder why. Eggheads practically lump it in with robot building and rocket science in terms of complexity: “It is a new discipline, lying at the intersection of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas,” according to David Hand, Heikki Mannila and Padhraic Smyth, in their book Principles of Data Mining.
Granted I’m no expert – I understood one out of every three words those people wrote – but I’m convinced that data mining doesn’t have to sound so intimidating. After all, at its most basic level, it is simply the act of telling a computer how to do things we already know how to do. Exploratory data analysis, for example, can be used to automate the calculation of basic descriptive statistics – mean, variance, range – and, without getting too much more complicated, perform correlations and generate stuff like boxplots. It’s nothing many of us haven’t done; the difference is in the implementation.
Applications abound. In a 2000 post to NICAR-L, Steve Doig raised the question of whether journalists could have uncovered the faulty Ford Explorer tire controversy earlier with the help of data mining. Of more immediate relevance is the upcoming Census, which data mining was practically conceived for. On the local level, say you’re a tech-savvy cops reporter blessed with a consistently updated database of city crime. Write a handful of good algorithms, and presto: the computer is looking for hotspots, flare-ups, and correlations on its own. You save time. You have quick access to larger quantities of meaningful information. The computer might find patterns that are too obscure or complex to find by hand. And bias and error are isolated within the construction of the algorithms.
I am clearly oversimplifying things, but I hope to hash out the more complex stuff in the coming months. If anyone is interested, a few Web sites out there offer solid (although still dense) introductions for the uninitiated, and I’ve found them to be amazingly helpful: KDnuggets and StatSoft to name a couple. I’ll keep trying to figure this stuff out, but in the mean time, I’m very interested in what people think. Is true data mining worth the time it takes to learn? Has anyone already implemented it the newsroom? Drop me an e-mail or post a comment if you have a thought.

Comments | Post yours
Post your comment