It’s amazing how people can come together around a common set of problems.
Last weekend, I joined about 450 people at the Sunlight Foundation’s Transparency Camp, an “unconference” focused on open government. Attendees included data wranglers from the federal government, developers both professional and amateur and a smattering of journalists focused on public affairs coverage. There were even a few representatives from Occupy D.C.
In true unconference format, the crowd itself determined most of the sessions in an effort to best reflect the interests of the open government crowd. That means each morning, dozens crowded around the “Wall” to pin their session ideas and recruit collaborators.
But one underlying topic kept popping up again and again in sessions throughout the weekend: How to use technology to make more sense of electronic text.
As it turns out, journalists aren’t the only ones frustrated and stymied by badly scanned records, massive document dumps and hard-to-parse legislation — developers are too. And at TCamp, they talked about how to join forces to fix it.
Hard to compare
Star-Ledger statehouse reporter Sal Rizzo had a hunch about the bills sailing through the New Jersey legislature to Gov. Chris Christie’s desk.
Some of the Republican governor’s major proposals — tenure reform, for example — bore a remarkable resemblance to so-called model bills drafted by the fiscally conservative American Legislative Exchange Council, which provides such mock-ups to paying members like lawmakers.
Identifying those similarities in actual bills, and confirming real connections to models that inspire them, could quantify what government watchdogs Rizzo spoke with already suspected: that the practice of using model bills was on the rise nationwide. And that may be a problem, these watchdogs say, since ALEC and groups like it aren’t registered to lobby and face no associated regulations.
During his session at TCamp, Rizzo explained that comparing the bills in the legislature with the bills from ALEC wasn’t very easy. Slight differences in the text threw off plagiarism checker software like DOC Cop and Beyond Compare, typically designed for academic use.
Rizzo and his colleagues at The Star-Ledger eventually settled on a system of reviewing the documents manually in search of similarities like numbers, sentences and ideas. They developed benchmarks to decide which bills were tied to ALEC models, eventually publishing their discoveries in a story April 1.
The investigation took months.
Called Superfastmatch, the application allows fine-grain comparisons among large numbers of documents. Users can sort matching documents by the number of similar fragments and jump to the copied segments.
The project is open source and available on Github now. Developers also plan to use Superfastmatch to power a U.S. version of the U.K.’s Churnalism.com, which detects similarities between news stories and press releases.
Rizzo told me in an email conversation that Superfastmatch is already a better solution than the software he tried, since it can detect similar phrases as short as six words. But he did have few caveats and suggestions.
“Bill language is frequently borrowed among the states just as a matter of course, though, so there could be many cases when you get flooded with procedural phrases,” Rizzo said. “It might be easier to have something that gives priority to numbers and dates, and words that are not frequently used.”
Toward better text recognition
Other developers at Transparency Camp were interested in a more basic problem. Even when public records are released by government agencies, they often exist only as scanned images without searchable text. And if the computer can’t read it, users aren’t able to automate any sort of analysis.
Technology called optical character recognition, which automatically detects text in such images, is pretty widespread. By comparing the images of text characters with a training set stored in the program, OCR software makes guesses about the letters and words it sees on the page, outputting the result in a text file. As we’ve found in our own testing, the best ones are often pricey.
While the group only initiated their discussion about how to create an open-source OCR solution with “non-terrible output,” session organizer Derek Dohler is keeping up the conversation online with a collaborative Google Doc.
Text analysis of public records is a problem massive in scope. It won’t be solved quickly or easily, as TCamp’s five years of existence can attest. But uniting 450 people interested in working toward an answer seems like a good way to try.