Lab news

Needlebase is dead; Long live Needlebase

Needlebase was by far our top-performing Web scraper. With its death on June 1, what will rise to take its place?

Early this morning, Google deactivated a highly regarded Web scraping tool that allowed users to quickly and easily mine information locked away in online databases — all without writing a line of code.

Needlebase, originally created by ITA Software, used a visual representation of Web pages to guide users through the Web scraping process. There was a learning curve and it wasn’t always cheap, but the software worked extremely well, making it one of the most popular solutions for journalists looking to easily extract data for stories.

Now it’s gone. And what’s worse: there’s nothing out there to take its place.

Mum’s the word

Google initially snapped up ITA Software for the company’s ability to gather travel information. That acquisition was finally OK’d by a federal judge in October after an eight-month review process that left the future of Needlebase in limbo.

A January announcement by Needlebase chief Justin Boyan ended that: On June 1, Needlebase would be put out to pasture.

But Google’s not exactly known for trashing useful functionality. As Boyan pointed out in a blog post announcing the retirement (the link is now dead), his team was “hard at work planning how to best integrate Needlebase’s technology with Google’s portfolio, which includes structured-data initiatives like Fusion TablesGoogle RefinePublic Data Explorer, and Freebase.”

But after six months, Google’s still not ready to reveal how Needlebase’s Web scraping prowess might end up in its other products.

“I don’t have any news I can share, except to tell you that a variety of structured-data pursuits are alive and well at Google,” Boyan told me in an email this week.

Mind the gap

Browser plug-ins

Dafizilla Table2Clipboard (Firefox)
This extension adds two menu items in Firefox, one in Edit menu and the other one on context menu. The item on context menu is visible only when table cells are selected. To select table cells, hold down Control key and click on cells to copy. This extension is useful because, although Firefox allows you to select rows and columns from a table and copy the selection to the clipboard, the original table structure is lost when you paste the text. Table2Clipboard, however, allows you paste data in Microsoft Excel or OpenOffice Calc with correct structure.

Table Capture (Chrome)
Inspired by Dafizilla Table2Clipboard, this allows you to copy HTML tables to the clipboard for use in a spreadsheet whether using Microsoft Excel, Open Office or Google Docs.

Scraper (Chrome)
This is a simple data mining extension useful when you need to quickly analyze Web data in spreadsheet form (starting in Google Docs). It is a work-in-progress (there are bugs) but is easy to use.

  1. Find a web page containing data you want to scrape.
  2. Highlight some data on the page similar to what you want. For example, select a row of a table if you want to scrape all rows.
  3. Right-click on your selection and select the “Scrape similar…” option.
  4. Press “Scrape” to update the table based on your current options.
  5. When you’ve basically got the data you want, press “Export to Google Docs…” to save your data in a new spreadsheet.

Data Toolbar (Internet Explorer, standalone version)
DataTool Services call their product “The world [sic] easiest data scraping tool,” and it is very easy to use. The free version’s output is limited to 100 rows. The full version, with no limit on the number of rows of data, currently sells for $24 (one license per computer).

  1. Click on the data fields and images you want to collect.
  2. Add fields from the “details” page if appropriate.
  3. Mark “NEXT” page option.
  4. Select “Get Data” to download images and data.
  5. Save results in spreadsheet.

– Ed Ramthun, AFSCME

For now, journalists and others looking for user-friendly Web scraping solutions will need to check out a range of other companies offering products ranging in sophistication and cost.

Outwit is one of these contenders, and although it was an average performer in our tests, the development team behind the software is constantly refining it.

There’s also an entire batch of easy-to-install browser extensions that can help in some situations, although in most cases the data you’re looking for has to be formatted in specific ways. Ed Ramthun, with the American Federation of State, County and Municipal Employees, suggested using TableCapture, Scraper, Data Toolbar or Dafizilla Table2Clipboard. All are free, although Data Toolbar is limited to 100 rows of data unless users shell out $24 for the full version.

There’s one more option for code-averse reporters looking to mine data for their stories — just learn to code.

Becoming proficient enough to code a simple Web scraper does take time, but to Francis Irving, it’s a worthy investment if it’s a task you perform often.

He says that’s because user-friendly, off-the-shelf tools like Needlebase and competitors like Kapow “always reach a barrier” when dealing with different kinds of scraping tasks.

“They’ve had to add loads of bolt-on features,” said Irving, the CEO of the data hub site ScraperWiki. “These tools end up very complicated in the end.”

Irving argues that Web scraping tasks provide the perfect opportunity for people to learn programming.

“There are people who pretend they can’t learn to code. But it’s a basic literacy you need in this century,” Irving said.

ScraperWiki provides a few different options to achieve that literacy. First, users can search for scraper programs already crafted by others, say a simple Twitter script. You can run that code yourself or check out the associated data to get a sense of how it works.

If you can’t find what you’re looking for, you can try writing your own code in Python, Ruby or PHP right into the site, where you can then run the scripts and view the documentation if you’re stuck. There’s even a public email list where other ScraperWiki users can help with questions and contribute code examples.

“We’ve got quite a good community of people who, particularly for journalism, will help with projects in the public interest for fun,” Irving said. “They’re mad — in a good way.”

This basic functionality is free, but everything you write and scrape will be public. Premium accounts, which allow users to make content private and schedule code to run regularly, are $9 a month for individuals, $29 a month for businesses and $299 a month for corporations.

If your project is really complicated (and your pockets are a bit deeper), you can also have ScraperWiki’s own data scientists do the job. An evaluation of the task costs $150, but the job itself may range from $700 to $15,000, according to Irving.

So far, his team has worked with news organizations like Channel 4 in the U.K. (where ScraperWiki is based) and on large business-related journalism projects. Plenty of data journalists are also members of the ScraperWiki community.

Keep calm and carry on

Another plus of ScraperWiki: It’s not likely to go anywhere any time soon. Although the Needlebase team gave plenty of notice so its users could migrate their data, Irving said the structure of companies with proprietary technology like ITA can change quickly with acquisitions or shifts in business strategies, leaving users in a lurch.

With features like an open source code, an API and an active community, Irving said there’s more stability with companies like his.

“If there’s a community built around it and it’s an open product, it’s a better bet,” he said.

Becoming a programmer is far from the only option for reporters looking to mine data in the wake of Needlbase’s demise. But the lack of a go-to solution does mean more experimentation is required to get the job done.

Whether that means learning a bit of Python or trying out more user-friendly options like Outwit is up to you.

About Tyler Dukes

Tyler Dukes is the managing editor for Reporters' Lab, a project through Duke University's DeWitt Wallace Center for Media and Democracy. Follow him on Twitter as @mtdukes.
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.