As the Big Data beast fattens, will privacy and ethics get gobbled up?
The Michigan Theater is at 603 East Liberty St. in Ann Arbor. Athletes Tom Brady and Cal Ripken have the same body mass index, 27—lower than Dr. Phil’s but higher than Abraham Lincoln’s. Austria’s fertility rate peaked in 1963 and has been falling steadily ever since. Q Lending Inc., of Coral Gables, Florida, received the smallest bailout from the TARP program, at $10,000.

I’m sure you found all of these as fascinating as I did, undoubtedly also wondering where this was going. These facts and a few gazillion others come to you courtesy of Factual, the brainchild of mathematician Gilad Elbaz, who gave us the company that is now Google’s AdSense. In Factual’s 500 terabytes of storage, there’s data from sources governmental and private, on topics broad and narrow, profound and trivial. It’s worth a wander through the website and its featured data sets to see just what it’s been vacuuming up.

A feature article in the March 24 New York Times tells us the company’s plan is “to build the world’s chief reference point for thousands of interconnected supercomputing clouds,” and goes on to describe of Factual’s clientele and and how they use the product. It also names a few competitors, including Infochimps, Gnip, and of course Wolfram Alpha, which partially powers Siri. Factual, by the way, is hiring; its “data specialist” jobs sound more than a little familiar, even if the page describing them lists 2010–2011 internship opportunities. Oops—I guess bad data can creep in everywhere.

This came hard on the heels of the announcement that the Statistical Abstract of the United States had been saved at the last moment by ProQuest. I’m glad of that; it seemed a shame that the government no longer felt it was worth publishing. I should be clear: I’ve never been a fan of the Abstract. (I’m a World Almanac sort of guy.) While its various elements are valuable and come in handy, the way in which it was organized—particularly the index that gave table numbers rather than pages—seemed stubbornly user-hostile to me. And the web version, consisting of large PDF slabs of tables, has gone from understandably simple to gratingly low-tech. Adding Excel versions was nice, though the whole thing still comes off as antediluvian.

Maybe ProQuest will attend to these shortcomings. In any event, these make for a sharp and illustrative counterpoint. One way of thinking about compiling Lots of Data is to organize it, by category—which perhaps yields some context and texture—add some metadata and a search mechanism, all in the service of providing access, so individual people can find a specific fact or set of facts in answer to a question.

Another way, only now feasible, is to mush it all together and see what can be learned. Not by an individual, necessarily, but rather by throwing tons of computing power at it to see what emerges. Both are attempts to somehow wrap our arms and minds around the vertiginous scope and complexity of data being generated and stored every second.

The name “Big Data” gets thrown around a lot, to denote this massive-data-conglomeration phenomenon. We’re told this will be an opportunity for information-focused people to collect, curate, manage, organize. All likely true, and all worth pursuing as extensions of work we’re familiar with.

Go one step further, though. How about professionals who work to humanize this field? Those who think about questions of privacy, authority, quality, authenticity, rationality, and ethicality. Who center these processes in efforts to better the human condition and the lives of individuals. Who build tools to gyre and gimbal in the taffeta of data to find just the right thread for a person in need. Somebody like, I don’t know, a reference librarian . . . but that’s another story.

JOE JANES is associate professor at the Information School of the University of Washington.