Research

Analyzing documents with the help of the crowd

[Editor's note: This is Part I of a two-part series exploring the uses of Amazon's Mechanical Turk in public affairs journalism. Check out Part II for tips and tricks on using the service for inexpensive, accurate transcription of audio and video.]

References

Tedious parts of an investigative journalism project are often ideal tasks for a computer program — but this isn’t always the case.

A few of the most repetitive parts of the job, like separating and tagging documents from massive public record dumps, still need humans. And that’s when it’s good to get a little help from the crowd.

By tapping into Amazon’s Mechanical Turk service, reporters can shop out simple, rote tasks to a horde of unskilled workers paid by completing work submitted by others. There’s a learning curve involved in setting up jobs “mTurk” workers will complete quickly, thoroughly and cost effectively. But a growing body of research on the service and a tool under development at Duke University may soon help journalists make the most of the Internet collective during the reporting process.

‘I’ll know it when I see it’

This copper engraving depicts Mechanical Turk's namesake, a supposed chess-playing automaton that defeated a number of notable challengers. A master chess player was actually underneath the cabinet "pulling it's strings."

MTurk went online in late 2005, smack in the middle of a two-year period that saw the launch of Facebook and Twitter. The social networks would go on to help inject “crowdsourcing” into the lexicon of new media.

As Katharine Mieszkowski explained in Salon, Amazon created the service to solve the kinds of problems not even its clever algorithms were capable of tackling.

[Mechanical Turk] celebrates the fact that we have become part of the machine. For fees ranging from dollars to single pennies per task, workers, who cheekily call themselves “turkers,” do tasks that may be rote, like matching a color to a photograph, but they can confound a computer.

These types of tasks have a common theme: they lack clear definition or parameters, which a program needs to do its job. While humans may “know it when they see it,” a computer can’t yet match that intuition.

Here’s how mTurk works: Requesters write up instructions for discrete “HITs,” or human intelligence tasks in mTurk’s parlance. They specify how many different workers they need to complete the HIT (it’s sometimes more than one to account for error) and the payment per HIT. Big projects are often split into multiple, smaller HITs called batches that are then posted to the market.

Workers select the jobs they want to take on and work on them within a requester-specified time limit, then submit their results. Requesters approve the work (and pay workers) only if it’s acceptable.

That payment is pretty small. Research by New York University Professor Panos Ipeirotis in the journal XRDS: Crossroads found 90 percent of the HITs paid out less than 10 cents, with 70 percent paying out 5 cents or less.

This analysis confirms the common feeling that most of the tasks on Mechanical Turk have tiny rewards.

Despite the small price tag, the jobs that show up most often in the mTurk catalogue would be familiar to beat reporters. Ipeirotis’ research shows transcription jobs, as well as categorization tasks, show up most commonly in lists of keywords.

You can also find evidence of mTurk effort in the real world. The photo agency Magnum harnessed the workers to tag its own archive, leading to previously undiscovered photos from the American Graffiti set. ProPublica used it to scrape stimulus data locked behind a search-only government database.

A transcription service called CastingWords has even built a business right on top of mTurk’s infrastructure by charging users a little more for increased accuracy and ease of use.

Crowd-sourcing the first pass

CastingWords’ model isn’t unique.

At Duke University, a group of graduate students led by computer science professor Jun Yang spent a semester creating a Web-based interface that would allow journalists to effortlessly harness the power of mTurk (as well as social networking sites like Facebook) to split up massive public record dumps into individual documents. The resulting files would then be much easier to analyze with programs like DocumentCloud or Overview.

[Full disclosure: The Reporters' Lab is working with Yang on the project and I'm listed as member of the team in a proposal to the Knight Prototype Fund.]

Called FirstPass, the system helps users skip all the complicated parts of submitting HITs. Reporters just upload their files to the site, enter how much they’d like to pay and when they’d like the job done. They can even submit a few questions for workers to help tag and classify the documents.

“You use our website as a way to facilitate crowdsourcing,” Mohammad Mottaghi, a member of the team and student in Yang’s computational journalism class, said. ”We generate the results for you and you take care of payment.”

Before building the proof of concept, the group developed a few best practices by submitting the jobs to mTurk workers in different ways. That research helped them refine the structure of the HITs to keep costs down and accuracy high.

“It just turned out to be a very interesting exercise, because it’s not at all trivial to figure out how to present tasks to the user in a way that would discourage just clicking through things in a random fashion to accomplish task and earn money,” Yang told me in a May 4 interview.

After testing their prototype on a 1,500-page dump of emails related to the Deepwater Horizon oil spill, the group averaged costs of about $3.50 per 100 pages with 88 percent accuracy. The entire job took about six-and-a-half hours to finish.

While FirstPass isn’t ready for prime time yet, the goal is to bring down prices and turnaround time with more fine-tuning. The group is also looking to add some computer assistance — machine learning specifically — to help workers complete tasks effectively and issue accuracy scores for the requester.

But the main focus remains usability. By cutting out the complexity and expertise that forms a barrier to entry for first-time mTurk users, Yang is hoping to make FirstPass a no-brainer for journalists looking to take the initial step toward analyzing large, raw document sets.

“My hunch is where this system is going to have a nice niche market is for people who don’t want to deal with any of the setup,” Yang said. “It’s just very simple to use.”

About Tyler Dukes

Tyler Dukes is the managing editor for Reporters' Lab, a project through Duke University's DeWitt Wallace Center for Media and Democracy. Follow him on Twitter as @mtdukes.
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.