Digging through documents during investigative reporting projects just got easier thanks to a new version of an open-source project from the Associated Press.
Dubbed Overview, the software can help journalists sift through thousands of PDF files to find patterns obscured by the sheer volume of information. Although a downloadable prototype of Overview has been available since early 2012, developers timed the launch of a more user-friendly Web version with this year’s Online News Association Conference, which begins Thursday.
How it works
- Ingram, Stephen; Munzner, Tamara; Stray, Jonathan. 2012. “Hierarchical Clustering and Tagging of Mostly Disconnected Data.” University of British Columbia Department of Computer Science Technical Report. TR-2012-01.
Journalists who regularly deal with public record requests often have more documents than they know what to do with, and investigative reporters would benefit from a more efficient way to parse these big blocks of text in search of hidden stories.
Machines are already really good at solving small pieces of this problem. They can find keywords quickly, provided you know what you’re looking for. They can tell you which words or phrases appear most often overall, and can even translate that frequency into the much-maligned word cloud. They can try to extract names and other entities from the text, often poorly. But none of these methods help much when it comes to discovering novel, interesting and newsworthy things.
So two years ago, Jonathan Stray and his team set out to address that problem. Using his research background in computer science, Stray developed a prototype capable of analyzing the text extracted from PDFs, automatically sorting documents into digital piles according to their content. The more times a grouping of documents contains a word or phrase, the program’s assumption goes, the more closely those documents are related.
In this way, Overview’s algorithms can quickly construct a map of the major topics contained inside a collection of documents that would take days or weeks to read one by one.
“It’s basically a tool to organize a pile of documents. It takes the documents and tries to break them into topic clusters,” Stray said in a phone interview last week. “It tackles the mountain of paper.”
Users can then tag swaths of documents with relevant, contextual phrases for further refinement — something a computer would find all but impossible. Stray used Overview to make sense of about 4,500 pages of declassified Iraq war logs, while the Tulsa World’s Jarrel Wade put it to work analyzing 8,000 emails from the local police department. He found the department “purchased millions of dollars of under-powered and under-tested computer hardware, resulting in a multitude of problems.”
Emails showed complaints from the field in which officers were unable to get basic police information about dangerous calls when they were en route to scenes, or network dead spots around town that officers were completely avoiding.
Although the topics are vastly different, Stray said the size of these document collections made them ideal candidates for the software.
“Overview doesn’t get interesting until you have at least a few dozen documents,” he said.
The prototype could handle about up to about 20,000 documents, enough for the most common reporting projects.
“We’ve found with most reporters, their actual document sets are in the range of a few thousands,” Stray said. “That’s the sweet spot.”
As powerful as the prototype was, Stray said Overview suffered from its involved setup procedure. Users had to employ the command line and download several files just to get started.
“Having to download a few pieces of software was a huge barrier,” Stray said. “We probably lost three-fourths of our potential users just from that.”
With the help of a $475,000 grant the Knight Foundation awarded in summer 2011, Stray’s been expanding his team to help get Overview on the Web. There, users can register for an account and retrieve files directly from their accounts on DocumentCloud, the IRE-supported document management application now considered a “standard” for this sort of work.
That functionality is a huge plus for the DocumentCloud’s 1,200 or so active users, which previously would have needed to use a little scripting, like the rudimentary Python program I wrote, to prepare their material for Overview (the prototype was later updated to parse PDFs and txt files automatically).
Stray said the Web version of Overview has also been modified somewhat based on feedback from users.
Gone is the “dot view” that in the prototype showed a top-down view of each document grouped into topic clusters. In its place is an expanded “tree view,” which allows users to view these topic clusters by group, subdividing more and more as they extend along the branches.
“If anything, it’s stripped down from the prototype after determining what features users actually use,” Stray said.
The new version is also limited to about 1,000 documents for now, although Stray said he expects to increase that limit soon.
“We are eventually targeting document set sizes in the millions,” Stray said in an email.
Stray said his group is only a few months into its grant funding cycle, meaning there is plenty of development to come (aside from basic bug fixes, which Stray said users are likely to experience at first).
At the top of the list: improved integration with DocumentCloud, like the ability to automatically import the tags created in Overview. The team is also working on upload options for non-DocumentCloud users and more visualizations to help users analyze their document sets in different ways.
What users will see next, Stray said, depends largely on how they use the service and what journalists think will be most valuable during the reporting process.
“It depends a lot on the features our users need,” Stray said. “This is really just the beginning.”