Text documents

Reporters collect documents. Lots of documents. They might be printouts of internal emails, hearing transcripts or letters and memos. Each collection has its eccentricities, which we describe in the documentation.

The current state of the art in a newsroom is keyword search. Reporters have heard of, but not used, named entity or information extraction techniques. They usually create paper piles of interesting documents. Is there another way? We’re also curious: Could we formalize a reporter’s definition of “newsworthy?” Could we train a system to copy his or her news sense?

These are just a few of the documents we’ve collected. We’ll add more as we upload and document them, but contact us if you are looking for something different. Some we haven’t processed yet are all of the earmarks requested of the House Appropriations Committee just before it ended the practice; emails related to defective Chinese drywall from Florida with minimal redactions; semi-structured fact sheets created for the EPA Brownfields program and quasi-xml collections of procurement announcements from the General Services Administration.

Transcripts

Combatant Status Review Tribunal

Downloaded from the Defense Department’s website, these are the transcripts of the high-value detainee hearings at Guantanamo Bay. We’ve already taken a sample of these documents and created an accurate text version of those pages for the purpose of testing OCR software.
Readme | Data (28mb zipped)

E-mail and correspondence collections

Interior Department on the 2010 oil spill

About 2,000 internal emails from the Interior Department’s Minerals Management Service concerning the Deepwater Horizon oil spill of 2010. This set was from The Washington Post’s Freedom of Information Request; other newsrooms received somewhat different records depending on the timing and wording of their requests. (We don’t know why this file is so big for so few records. It may have to do with the form of the PDF). We are still working on coding a sample of this dataset for testing named entity extraction tools.
Data (607mb zipped)

Wisconsin governor e-mails

When Wisconsin Gov. Scott Walker tried to curtail union rights as part of a budget bill in 2011, he claimed the public was behind him. Wisconsin Watch reporter Kate Golden analyzed a sample of his e-mails and hand-coded whether they expressed support or opposition. Her reporting found he was right — most of his correspondence agreed with him. Kate donated her work to the Reporters’ Lab, but placed a reasonable restriction on its use — she felt the people who wrote the governor probably did not expect their email addresses and names to show up on the Internet as a result. While nothing would stop us from publishing these — they are public record — we agree. Contact us if you’d like to work with these records.

Your Seat at the Table

In 2009, the Obama transition team did something unprecedented: it made public all of the letters recommending policies and priorities for the new president from special interest groups, think tanks and others. We’ve scraped those letters and sent them through a first pass of OCR. We’ve also sampled them to pull out organization names to test entity extraction. We’re linking to the raw documents. The recognized ones are public on DocumentCloud.
Data

Press releases, etc.

Securities and Exchange Commission actions

A compilation of 1,411 actions taken by the SEC in 2009, downloaded in January 2010. Because of the timing, the records contain some of the fallout of the Bernard Madoff case and includes the initial complaint against R. Allen Stanford.
Readme | Data (785mb zipped)

We have other documents not listed here — ask us if you don’t see the kind of documents your project needs.