The tip wasn’t great to begin with, but its supporting evidence was thorough.
Thanks to his source, T. Christian Miller found himself in the possession of an organization’s entire email archive. Hidden inside tens of thousands of emails, the source said, was a story worth uncovering. All Miller needed was time.
Even for a seasoned investigative reporter at ProPublica, that’s a resource in short supply — and this story wasn’t worth it. Miller shelved the project, which would have required days of sifting through mostly irrelevant material to find what he wanted. It wasn’t the first tip to miss the cut, and it certainly won’t be the last.
“There are lots of stories left untold in big huge dumps of PDF files and email files because reporters don’t have the tools to go through [them],” he said in a phone interview in early July.
That problem was far from Miller’s mind while attending a poetry course at Stanford on a Knight Journalism Fellowship. But when the class sat in on a demonstration of a new piece of software helping librarians analyze the email archive of modern poet Robert Creeley, something just clicked.
“Suddenly, the lightbulb goes off in my head and I think, ‘This could be an incredible tool for investigative reporters,’” Miller said. “It was just this moment of synergy between two completely different ideas.”
That chance encounter prompted Miller to contact Stanford doctoral candidate Sudheendra Hangal, who created the downloadable, browser-based tool called Muse to analyze email archives. Although he didn’t set out to help reporters, Hangal quickly learned his original idea could have a much broader impact. He even gave a presentation at the Investigative Reporters and Editors Conference to connect more journalists with his work.
“By lucky coincidence, it happens there’s a lot of need for people to look at other people’s correspondence as well,” Hangal said. “Journalists are an obvious example.”
an information treasure trove
Give it a try
- Download Muse
Because early users were wary about uploading their email archives to external servers, Muse was designed to run on users’ personal computers.
- Load your email
Sign in using your Gmail, Yahoo, Hotmail or Stanford account to begin loading your email into Muse. You can also load up an email file in mbox format by clicking “Other email accounts.”
- Find interesting patterns
Examine graphs of sentiment, groups or individuals over time. Spikes in the graph may indicate potentially interesting email messages.
- Skim and locate emails
Click on any point in the graph to pull up the text of the corresponding emails. Use a jog dial or the arrow keys to navigate through each message.
Email can be a valuable component of any document-driven story, as long as reporters can unlock the information inside.
When Miller requests email records through the FOIA process, he typically receives a huge, heavily redacted paper file. He can scan it and use optical character recognition to make searching for keywords easier, but these tools aren’t able to capitalize on the additional layers of information email can provide.
“You read them for content and I think you miss a lot of other data inherent in an email file,” Miller said. “You can’t capture that header information. You can’t capture chronology.”
These headers can help investigative reporters put together a timeline of events or build a social networking map to better understand a story.
And while reading through each email in an archive may be possible with smaller dumps, very few reporters have that option with massive collections.
“You cannot get a sense of what a conversation is about if you’re faced with an archive that’s maybe 20,000,” Hangal said.
Hangal designed Muse with these problems in mind, although he was inspired not by email, but by his great grandfather-in-law’s 100-year-old diaries.
“I was just looking at how much they reflect about the time and how interesting it is to look back at history in a very personal way,” Hangal said. “I was thinking about what this meant for our great-grandchildren, who might someday want to understand life in the early 21st century. It would be great if we could bequeath to them some of our daily correspondence or communication.”
To make sense of that daily correspondence, Hangal built Muse to automatically analyze patterns. After downloading and processing your email archive, Muse extracts names and email addresses and uses its algorithms to cluster contacts, graphing the volume of the conversation. Users can refine the groups or view the graph by individual email addresses. These patterns are represented in a color-coded stacked graph, which users can click to pull up corresponding emails. Using arrow keys or a jog dial, users can quickly skim through sections of email to find what prompted interesting changes.
“You often get a clue because there’s often a spike in communication, either through the entire corpus or with a particular person, because there’s a particular event happening,” Hangal said. “Any deviation from the normal tends to be interesting.”
Muse also performs sentiment analysis on email text in an attempt to track patterns in emotion.
“That’s another way you can enter messages you think might be reflecting confidential information, might be reflecting exuberance, joy — whatever it is. And you can customize these word lists so you can put in whatever you’re interested in,” Hangal said.
Muse even narrows down the most discussed topics month by month, allowing users to reveal things they otherwise couldn’t with a keyword search.
One of Muse’s major limitations right now is that it’s only able to process archives in digital email formats, like mbox or personal storage tables (.pst). That’s a rare format for most responses to public records requests.
“At least for the five or so dumps I’ve seen, every dump has been different,” Hangal said. “We need to build a suite of tools that can do a data cleaning pass or data wrangling pass with the input.”
He’s also hoping his dialogue with journalists will help encourage them to request more email records in these electronic formats where possible. That lesson hit home for one journalist at the IRE conference, who spoke with Hangal after his presentation.
“Her attitude was, ‘Hey, I’m going to have to sit and read the whole thing anyway. What difference does it make whether it’s PDF or anything else?’” he said. “There’s a lot more you can do with the data when it’s in these proper formats.”
Right now, Muse and other projects associated with Stanford’s social computing initiative are funded by a National Science Foundation grant. Hangal’s on the hunt for additional funding sources that could help him build a production-quality version of Muse or integrate it into existing tools like DocumentCloud.
Over the summer, Muse got a boost from Stanford’s library, which funded a couple of developers to expand its capabilities for the archivists. With continued support like that, Hangal said he’d love to see his software make an impact far beyond its original intent.
“Muse is primarily my baby at this point. It’s been just one developer over one or two years,” Hangal said. ”But if we put a team behind it, even two or three developers, [we could] continue to build out Muse and have it used in more situations — help journalists, archivists, whoever else.”