Tools

‘Hypothesis generator’ helps mine big data

Mining for relationships between pairs of variables in large databases can take time -- and statistical expertise.

A tool created through a collaboration with Harvard and MIT could soon help journalists find relationships in massive amounts of data — even if they don’t know what they’re looking for.

The MINE application, which stands for maximal information-based nonparametric exploration, uses a series of statistical calculations to test the strength of each variable pair’s relationship. Input a spreadsheet, and out comes a score describing how strongly each pair is connected. Such a “hypothesis generator,” as its creators have called it, could help journalists looking to make sense of massive data sets by pointing out links they wouldn’t have otherwise discovered, paving the way for follow-up questions.

“It’s an attempt to capture relationships in data without having to know what those relationships look like,” said David Reshef, one of the lead authors of a paper describing the application published in the Dec. 16 edition of Science.

While the it’s only licensed for research at the moment, Reshef said they’re working to expand the license in a few months. That would enable journalists to put it to use.

‘Casting a wide net’

Discovering correlations between two different factors isn’t difficult when you only have a few of them. Suppose you suspect two variables — let’s say IQ level and test scores — may be related in some way. By plotting them on a simple grid, you can see evidence of a trend: the higher the IQ, the higher the test score. You can quantify that pattern using techniques like linear regression, calculating a trend line based on a “best fit” of the data points.

School data

To get a look at how the tool looks in practice, I dipped into a sample dataset (log-in required) from Investigative Reporters & Editors containing elementary school demographic and performance figures. Because the tool can only handle numerical information, the spreadsheet effectively had 14 different variables (columns) for each school, with information from number of students to average teacher salary.

Using MIC, the pairs with the strongest three correlations were:

  • advanced degrees to number of students
  • Number of teachers to poverty rate
  • Spending per student to English as second language

When testing the relationship strength according to the more traditional linear regression, which the tool also calculates, the third pair wasn’t ranked very high.

Are these relationships meaningful? What do they look like? That part’s up to reporters to answer.

But if you have hundreds of different factors spread across the columns of a spreadsheet and no idea how they’re all connected, analysis is a lot more complicated.

“There are a lot of different tests for finding these patterns,” Reshef said. “We were interested in a test that could capture as many as possible.”

Using a technique called the maximal information coefficient, or MIC, the tool Reshef and his brother Yakir created projects every variable pair onto a grid, then mathematically chops that grid up into more and more pieces to make any existing relationships stand out.

“If there’s a pattern present between two variables in your dataset, you should be able to take that scatterplot and be able to draw a grid on it that shows the pattern,” Yakir Reshef said.

The strongest relationship determines the MIC score, on a scale between zero and one. Ranking the variable pairs in this way means MIC is pattern agnostic — it doesn’t care how two factors are related or what that relationship looks like, just how strongly they’re connected.

“Using a more general method like this, you’re casting a wider net,” David Reshef said. “You don’t have to know the pattern to find it.”

Although that can uncover relationships you might have never thought to look for, it does have downsides.

“If you’re running something based on a specific model, an advantage is that the number that comes out is more easily interpretable,” Yakir Reshef said.

The MINE tool does however, offer a few additional calculations to give users more information on the nature of the connection between two variables. MAS, or non-monotonicity, describes how a relationship between two variables fluctuates up and down. There’s also a complexity rating, which tells you how large of a grid was needed to calculate the relationship strength — in effect how difficult the relationship is to describe.

“If your pattern can be captured by a 2×2 grid, it’s not that complex. But if it has to be captured by a 5×5 grid, there’s something more subtle going on,” Yakir Reshef said.

The MINE tool even ranks relationships based on linear regression — the best-fit trend line of the data — for comparison.

Correlation, not causation

But even with all that information, there’s still an important caveat to remember: Even the strongest relationships, calculated mathematically, may not mean much in reality.

“Correlation does not equal causation,” David Reshef said. “This thing is a hypothesis generator in a sense that it discovers relationships with data. It does not imply causality.”

The data may show, for example, that IQ is a great predictor of test scores. That’s a correlation. But from just those two variables, it’s almost impossible to tell whether high IQ causes high test scores, since any number of other factors could turn out to have a more direct influence on testing performance.

If it’s detailed enough, data can sometimes help a little here — accounting for other factors might reveal identical trends. But it’s reporting that will really add context to the numbers and relationships revealed by this tool.

That’s why the MINE application has such great potential for journalism. Using powerful analysis, it can help us focus our attention and our questions on links that would otherwise take too long to find, if we managed to find them at all.

After all, discovering these relationships is the easy part. What’s really hard is revealing to our audiences why they exist in the first place.

About Tyler Dukes

Tyler Dukes is the managing editor for Reporters' Lab, a project through Duke University's DeWitt Wallace Center for Media and Democracy. Follow him on Twitter as @mtdukes.
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.