Many reporters have expressed concerns with social media being too ephemeral. Between deleted tweets and posts that disappear into the timeline void, it’s hard to keep track of and find information more than a few days old — never mind finding information from a specific day or a year back.
There are a number of existing solutions that come close. DataSift and Gnip are enterprise-level solutions for monitoring Twitter streams. The Archivist provides aggregated information for Twitter searches. Topsy lets you search old tweets, but is hit or miss. Siftee purports to archive your streams and let you add tags, but it is still in beta and getting access has proven to be difficult.
So we decided to try and prototype our own Twitter Archiver, something reporters would want to use. In the process, we found out why someone else hasn’t beat us to it.
Decisions, decisions
Download the prototype
We’ve made our first attempt at a Twitter Archiver available and its code open source.
Download the zip file and run the installer to get started. For more information, check out the read me file.
There were a few things we felt a basic archiver had to accomplish. It had to:
- Download and save the user’s “home timeline” (the “home timeline” is what you see when you’re logged in and view Twitter’s homepage – not your individual user timeline) into a database.
- The resulting archive must be searchable, taggable and provide aggregate information.
- It must be extensible so Facebook or RSS feeds could be added in the future.
There were several ways we could go about making this system. We could have made it server-based and let people log into a website and have us manage everything. The problem for us is that we do not have the technical capacity necessary to deal with a large hosted solution. It would have to be secure and have constant uptime, and we couldn’t deliver either of those things without a lot of work.
The other way would be to create a desktop application users could run on their computers. We knew there could be problems here as well – the computer would have to be connected to the Internet to download data – but we didn’t anticipate how severe Twitter’s restrictions would be.
Our decision was to try and build a desktop application using Adobe AIR, mostly because it was cross platform and I have some experience in it. AIR applications can easily save data into an SQLite database and Twitter API connectors already exist in ActionScript, AIR’s programming language. Using the Twitter ActionScript API and a framework called RobotLegs to make coding easier, I got to work.
Working with Twitter’s limits
The first step was to improve the Twitter ActionScript API to take advantage of what Twitter calls “streams.” Twitter basically provides two APIs: A REST API which you access by sending a call to the server for every piece of data you want, and a streaming API where you open a socket to Twitter and get pushed notifications (tweets, user events like gaining a follower, etc.) in real time.
The REST API has limitations on how often you can request information, whereas the streaming API will let you stay connected indefinitely. The problem with the streaming API is that you can’t access past data, so it’s only good for grabbing data in real-time. You still have to rely on the REST API to get older data.
But the REST API has limits on how far into the past you can request tweets. Individual users can request 3,200 of their tweets, but they can only get 800 from their timelines. This limitation means that most users wouldn’t be able to turn the application off when they leave work for the night, because by morning they’ll probably receive more than 800 tweets. In those cases, the archiver could not go back and get all of the tweets it missed.
On weekends, you could miss hundreds or even thousands of tweets, with no way to get them except to go through all of the users you follow one by one and download every tweet you’re missing. Doing this would quickly hit the “rate limit” on Twitter and the API would refuse your requests for more data. This makes it essentially impossible to get any real amount of past data if you’re following active tweeters.
Another problem we had with the streams is that sometimes tweets come in very, very quickly. One of the options you have at your disposal when opening a stream is specifying up to 1,000 search terms (things like usernames or hashtags) to monitor. This is incredibly useful for following events or hashtags relevant to your job, even if they appear outside your timeline. Unfortunately, if you follow popular terms (like #olympics) in our prototype, the stream of tweets comes in so fast that it can’t keep up in its present state.
Lessons learned
This left us up a creek.
The prototype works fine when connected to the Internet, happily opening a streaming connection and saving tweets to a database as they arrive, as long as it’s not overwhelmed. But as soon as you close your laptop or quit the application, you risk missing a lot of data that you simply can’t get within Twitter’s existing framework.
Throughout the process, it became clear that a server-based solution would have been a better path. But rather than us hosting it, it would have to be something reporters could install onto their own servers (perhaps using something like an EC2 disk image) and run for their newsroom. Because it’d be operating on a server, you could expect it to almost always be running and connected to the Internet. You could also better separate the display code and the database code so the application would continue to function while downloading many tweets a second.
Twitter also provides an API newsrooms would be smart to look at. “Site stream” allows you to open a stream for hundreds of users at once. This means rather than getting the home timeline for just one logged-in user, you can get the timelines for up to 1,000 users. Add that to monitoring a number of terms and your entire newsroom can rely on one server and one stream to get all of the tweets reporters could ever want. Unfortunately, the site streams are still in Beta and have been for quite some time.
This was a good learning experience for us. It basically showed why something like this doesn’t already exist and let us explore dealing with APIs reporters may be interested in. Hopefully someday, someone will take a crack at building a server-based solution that doesn’t cost an arm and a leg and that newsrooms can use.

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.