
The Chicago Tribune's Brian Boyer removes his costume panda head before demonstrating PANDA, an open-source database system developed by Investigative Reporters and Editors. | Photo by Tyler Dukes
It was the lightning talks that set my head spinning.
I’m a bit new to the world of data journalism, and up until that point, my first time at the annual Computer-Assisted Reporting conference had stopped just short of overwhelming. But after a few mile-a-minute presentations in this particular session, I only had one clear thought: There’s so much to know, and not enough time to learn it.
Luckily, Politifact developer and University of Nebraska-Lincoln journalism professor Matt Waite gave the audience a little bit of reassuring advice gleaned from Zen Buddhism:
It’s bad to be an expert. If you know something and you think you know something, you ignore a lot.
That seems like the perfect mantra for attending this conference. So many of the journalists and developers who spoke encouraged attendees to treat what they learned as a jumping off point, to follow up as soon as possible with data exploration, investigative reporting and programming of their own.
In the spirit of that advice, here’s a collection of the top tools I picked up from the conference. Most I know nothing (or at least very little) about. In the coming weeks and months, I’m hoping to learn a little more about each of them, whether it’s in the form of a formal review or an informal exploration on the blog.
If you’re looking for an even longer list, check out Chrys Wu’s collection on her blog. And for conference attendees who have suggestions of their own, feel free to share them in the comments.
Evernote // Project tracking
Until one of the conference sessions, I dismissed Evernote as a simple note-taking service. But one of the speakers pointed out that it’s excellent for tracking progress in a story and sharing it with editors.
Excel // Web scraping
Using Excel’s importhtml() and importData() functions is a great way to scrape some Web pages without programming. It doesn’t always work, but it’s a good starting point. More on the process from session speakers Chris Keller and Michelle Minkoff.
CrocTail // Corporation tracker
CrocTail gathers and indexes information about corporations and their subsidiaries, based on their 10-K forms from the U.S. Securities and Exchange commission.
Pipes // News feed management
I’ve always wanted to play more with Yahoo Pipes, which provides a way to mashup and customize RSS feeds just the way you want them. The user interface is visually easy to manage too, so no programming is really required.
Refine // Data cleaning
Google’s downloadable application can quickly clean up messy data by clustering and fixing similar entries like names, places and groups for easier analysis. It can also export data into other formats.
Qlikview // Data visualization
Available in a full, free version, Qlikview is built for business intelligence, although journalists can put it to work for data analysis through visualization. There are also sharing and discussion tools to collaborate with others in the newsroom.
Junar // Data extraction
Junar allows users to collect, track and embed data from the Web just by submitting a webpage containing tables (file submission is also possible). A simple interface lets reporters select the table to process the data, which is then shared publicly on the site.
NodeXL // Relationship analysis
NodeXL is an Excel template for performing network analysis, or exploring how things are connected. As session speaker Peter Aldhous put it, think about it in terms of the game Six Degrees of Kevin Bacon.
Gephi // Relationship analysis
Another open-source tool for network analysis, Gephi is a standalone piece of software for Windows, Linux and Mac.
Scraper // Web scraping
Scraper is a free Google Chrome extension that can help users quickly collect data from simple Web tables and export the information to a spreadsheet.
iMacros // Web scraping
A Firefox add-on for recording tasks in your browser, iMacros can be used to scrape websites by defining a set of actions once, then automatically repeating it numerous times.
QGIS // Mapping
Quantum GIS is a free, open-source mapping system that can help journalists analyze geographic data. In addition to a user-friendly interface, it’s also supported by a large developer community that can help troubleshoot problems and find solutions.
ReVerb // Information extraction
An open-source project from the University of Washington’s Department of Computer Science and Engineering, ReVerb detects relationship between terms on the Web, using them to answer questions automatically.
Overview // Document set analysis
Overview is a visualization tool specifically designed to help journalist find stories hidden inside large document sets using information extraction and other techniques.
Columbia Newsblaster // News topic clustering
Newsblaster uses natural language processing to cluster news by topic and summarize what’s happening using multiple documents. It’s been around for a while and has seen some academic study, and it’s something in which we’re particularly interested at the lab.
The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.