Uncovering Hidden Data: 5 Ways to Take Data Mining to the Next Level
Somewhere between readily available datasets and complex encrypted code lies a wealth of information that’s not immediately evident. Let’s call it “hidden data.” This information isn’t necessarily hidden on purpose or even hard to find, but it can be extremely valuable. With a little bit of searching, you can find illuminating comments, exact search results and even additional data sets. Here is a list of simple steps that can take your data sleuthing to the next level.
Behind each webpage is a series of code, sometimes bearing juicy tidbits just below the surface. The “inspect element” option on a website (control-click > “inspect element” on Chrome) lets users look directly at its code. Why look at code? Programmers often leave notes to themselves —notes show up in green text—or other illuminating background data such as image titles. For example, Pete Hoekstra’s controversial Super Bowl ad gave even more fire to those who considered it racist, when a Twitter user discovered through code that the Chinese actress’ image was called “yellowgirl.” (source)
Ever tried to look for an old post on Facebook or Twitter? Forget it. Finding past posts is time-consuming and they are easy to miss. Using APIs, or Application Programming Interfaces, for a given site, you can search information within certain parameters, including keywords and dates. Many social networks and data sets have APIs that make searching them relatively easy. APIs allow apps like Tweetdeck to sift through Twitter. They are also what Visual.ly uses to generate social media data visualizations. Use the Facebook Graph API to find posts from particular people, days, topics and more, and you’ll never again have to deal with the rage of continually clicking on “more stories.” We recently published a list of APIs and other open data sources to get you started, and the New York Times features another excellent list of APIs to jumpstart any data journalism query.
While it’s tempting to convert data files to the simple CSV format, sometimes Excel, or XLS, can be more useful. Not only can you easily sort the data in nuanced ways, you could potentially find hidden data. Hidden comments in Excel give unparalleled insight into the thoughts behind the data. Additionally, sometimes these documents contain hidden rows of data that the source may not have intended you to see. Find them by highlighting the whole data set, right clicking and selecting “unhide.”
As part of The Wall Street Journal’s “What They Know” series, data journalists learned that a company called RapLeaf was installing cookies on common websites to gather in-depth information on users, from income to interests. To figure out what RapLeaf knew, WSJ had to crack the code. Government data need not be as fraught. If their information is somehow coded, government agencies are responsible for providing a key or internal document, which you can FOIA alongside the data itself. (Just keep in mind you’ll have to be very, very patient: FOIA requests can take many months to complete. The Reporters Committee for Freedom of the Press has FOIA letter forms here that you can use to generate and send a FOIA request to any Federal or State government agency.)
ScraperWiki is an amazing tool for finding data, but it also requires a bit more programming knowledge than the suggestions above. If you are familiar with Python, PHP and Ruby, however, you’re in luck. ScraperWiki allows data journalists to collect, or scrape, entire sets of data that might only be available piecemeal. At a recent U.S. Journalism Data Camp held at The Washington Post, coders and reporters alike “liberated” disparate data, from Polk County, Iowa mugshots to test scores at a school district outside of Washington D.C. Rani Molla is digital media master’s student at Columbia Journalism School. She’s a journalism reader, writer, photographer, videographer, data visualizer and general doer.