By Charles C. Duncan Pardo
In about six weeks Raleigh Public Record, an online nonprofit news organization based in Raleigh, N.C., will release a new open-source program to help journalists turn PDF files into structured data.
The new software will enable reporters to take an image containing data — say a scanned campaign finance return — and turn that into a spreadsheet.
This is a problem we at the Record have been trying to overcome for more than two years. The story started with Wake County campaign finance returns. The returns are filed as paper, and staff at the Wake County Board of Elections scan them in and put the images online. The problem is, the only way to view the data is to look at it page by page, and the only way to analyze it is to go through by hand and enter the data into a spreadsheet one row at a time.
We’re a small news organization; we don’t have the staff to do data entry for hundreds of pages of campaign finance information. We also don’t have the budget to hire some unfortunate college students to do it for us.
Edward Duncan, my brother and a full-time programmer, and I have been thinking about how to tackle this problem since 2010. We had been kicking ideas back and forth until Edward stumbled across this solution last summer.
The new program, called DocHive, aims to pull data from the documents and put everything into a spreadsheet.
Here’s how it works: the program converts the PDF into an image file using ImageMagick, then uses a template to break a page up into smaller sections.
For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the program will take each of those sections and turn it into a separate image file.
The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.
This method works with county-level campaign finance returns in North Carolina, but it can also work with almost any other standardized document format.
Want to go?
Feb. 28 – March 3
Those dastardly PDFs
Saturday, March 2, 11 a.m.
The new program works so well because it’s able to break the page down into its component parts and use OCR with the much smaller image. Each page could be broken down into as many as 200 smaller images to be processed into a spreadsheet.
We are working on finishing up the core functions of the program and creating a user interface so anybody can create a template. Right now, that’s done by hard coding XML.
The Record will release the beta version of DocHive at the NICAR conference Feb. 28 in Louisville, Ky. Development has been made possible by a grant from Raleigh’s own Beehive Collective (hence the DocHive name) and the kind folks at Reporters’ Lab.
Let us know if you’ve got any tricky document sets we can use to test DocHive or want to help test or prepare the new program for release. You can reach the development team at firstname.lastname@example.org.
Charles C. Duncan Pardo is the founding editor of Raleigh Public Record, a non-profit online-only news organization dedicated to public service and watchdog journalism in Raleigh. Duncan is also a part-time graduate student at Duke University, where’s he’s creating his own journalism program. He lives with his wife and 100-pound lapdog in East Raleigh.