Tools

Needlebase and the future of Web scraping tools

On June 1, Needlebase will cease to be. | Image courtesy of Jo Naylor

The computer-assisted reporting community got some bad news at the end of 2011: Needlebase, an effective and popular option for scraping data from Web databases, was on its way out.

As we pointed out in our review of the product, Needlebase is a valuable tool for journalists despite it’s steep price point. But after Google acquired final approval from a federal judge for the purchase of Needlebase owner ITA Software, the guts of the programming are now destined for Google apps like Refine and Fusion Tables.

Although it’s still fair game for the next few months, Google’s Justin Boyan said in an email that if Needlebase’s Web scraping tools do resurface in a Google product, “they will look very different.”

So what’s left?

Well, OutWit Hub, for one. It fared OK on basic jobs our review, although it didn’t come close to Needlebase in our testing. It’s also cheap, which is a big plus if your newsroom is strapped for cash.

Plus, there are signs OutWit may even get better in the future. In response to our review critiques, OutWit’s Jean-Christophe Combaz said version 2.5 will feature interface updates and more advanced scraping options, as well as wizards and tutorials to walk users through some of the more in-depth features.

Building something comprehensive is a tough task, which is why Combaz said his product is always a work in progress.

“The Internet is a large, complex and ever-evolving beast,” he said. “Giving people a way to harvest it without writing a line of code, in a flexible and consistent way, is quite a mission. We like it.”

There are other solutions as well. Programs like Kapow, Mozenda, Automation Anywhere and Visual Web Ripper are expensive (and all of them are on our testing and review list, so stay tuned). According to Nils Mulvad, a scraper expert who regularly trains journalists in data mining, Kapow once offered a free solution called OpenKapow that’s now gone the way of Needlebase.

The big question for us at the lab is why there aren’t more competitors in the user-friendly Web scraping space — especially at reasonable price points. Many data journalists are so desperate for solutions they’ll just tap into the power of Python to make their own scrapers. The International Center for Journalists sponsored a program for Hacks/Hackers D.C. in late January to teach attendees how to do it. There’s also ScraperWiki, which leverages a community of coders to help build better scrapers.

“When it comes to using scrapers for production of high-quality news stories updated regularly, there is no way out of either programming or expensive software right now – as I see it,” Mulvad said (if you’re attending NICAR 2012, you can get training from Mulvad firsthand at two of his scheduled sessions). “Hopefully it will change,” Mulvad said. .

It’s a change that might be easier to make if we know what the complications are, so we’re hoping to identify some of the supply-side problems of building a robust, open-source Web scraper here at the lab.

Think you’ve got a good answer? Know of some good scraping solutions we didn’t mention? Let us know in the comments.

About Tyler Dukes

Tyler Dukes is the managing editor for Reporters' Lab, a project through Duke University's DeWitt Wallace Center for Media and Democracy. Follow him on Twitter as @mtdukes.
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.