Scraping information from the Web can be a complicated affair.
Although there are some off-the-shelf options for performing an online data grab, sometimes you just need something cheap and quick for an exceedingly simple job.
That’s why we’ve tested and reviewed four Web scraping browser extensions, whittling them down to our top two. These applications are primarily designed to copy and paste HTML tables. But despite their limitations, their relative ease of use and inexpensive (or free) pricing makes them valuable time-savers for small jobs.
Want a more full-featured Web scraper? Check out Reporters’ Lab Reviews for the entire list.
What won’t work (ever)
There’s one thing I should point out before getting started: In each set of tests, these applications were unable to extract data from anything other than an HTML table.
Data organized in any other format (<div> tags for example) just won’t work. If you’re not sure how to spot an HTML table, take a look at the one below. Viewing the source code of this page would show the information surrounded by the <table> tag, which is further organized by tags indicating headers (<th>), rows (<tr) and individual cells (<td>) — much like a typical spreadsheet.
Additionally, none of the browser extensions were able to take on more complex tasks traditionally reserved for more robust scrapers. Additional details linked from the data in detail pages, which would normally require users to click through to view, are out of the question. If your target database requires you to enter a search term before viewing results, these extensions won’t be good options because they can’t automate the process.
If your scraping job meets any of the above criteria, look for another solution.
|Data Toolbar||Scraper||Table2Clipboard||Table Capture|
from detail pages
on search results
What does work: scraping multiple pages
One of the most maddening parts about manually grabbing online data is clicking through multiple pages. But its repetitive nature makes it an ideal candidate for automation, which is exactly what Data Toolbar ($24) does (read the full review).
The Internet Explorer add-on had no problem nabbing the information from a 406-page listing of lobbyist information, even distinguishing and omitting duplicate header information.
We also found that even with searchable databases that allowed wildcards (for example, returning all names that start in “a,” “b,” “c,” etc.), Data Toolbar saved some time by gobbling up the results. It’s not full automation, but it’s a start.
If simple, multiple-page databases are a common occurrence for you, $24 will be well spent.
Copying tables, simplified
By adding on an option to the “edit” menu to “Copy all tables,” the free app allows you to paste information directly into a spreadsheet, preserving the tabular formatting. It can also capture the HTML links embedded in the data.
These add-ons may not do it all, but they’re a painless addition to any journalist’s arsenal and might just save time on those one-off projects.