You Can’t Cook a Cow: The Problem with Raw Data

Bill Hooker is a regular advocate of “open science,” and is currently supporting a new subversive proposal: to make all raw data freely available on some sort of Creative Commons type license.

It sounds like a perfectly reasonable idea on the face of it, but I have to say, I’m a little dubious about it when I read things like this:

First, note that papers do not usually contain raw (useful, useable) data. They contain, say, graphs made from such data, or bitmapped images of it — as Peter says, the paper offers hamburger when what we want is the original cow.  Chris Surridge of PLoS puts it this way:

A figure in a paper is a way of representing the raw data in such a way to best illustrate the point the author is making. A figure then is the product of an operation upon the raw data, and that operation results in a loss of information.

The raw data could have been presented in a host of different ways possibly supporting other conclusions not thought of by the author. Equally if a reader had raw data compatible with that the author obtained wouldn’t it be useful if it could be processed in the same way for comparison? Wouldn’t it be much better for readers to have access not only to the figures in a paper but also to the underlying data and the transform that created it. In this way no information, neither implicit nor explicit, is lost.

So if authors want to make their data openly and usefully available, they will need to host it themselves or find someone to host it for them.  Many journals will host supplementary information, and many institutional repositories will take datasets as well as manuscripts.  I have been saying for some time that it should by now be de rigueur to make one’s raw data available with each publication. This is very rarely done — even supplementary information, when I have come across it, tends to be of the hamburger-rather-than-cow variety and so not very useful.  (The situation speaks sad volumes about the emphasis on competition over cooperation within the scientific community and, perhaps in many cases, about the quality of the raw data in question, if only one were ever able to see it; but I digress.)

It may be a discipline-specific thing (Bill’s background is in the life sciences), but when I read that, my first thought is “These people have never looked at raw data.” At least in physics, we don’t generally present raw data in papers for the same reason that you can buy hamburger in grocery stores, but can’t get a live cow: you can cook hamburger in your kitchen, but you can’t do much with a cow.

I mean, to take an extreme example, look at particle and nuclear physics. If you want access to the “raw data,” first you need to specifiy just how raw you’d like it. The lowest level, straight-from-the-detector raw data is noting but a collection of time series of voltage signals from the thousands of little detectors that make up the drift chambers and calorimeters and the rest. Even the people who run the experiments hardly ever look at this stuff, because it’s just not useful in that form.

The next level of “raw” data in particle and nuclear experiments is a set of reconstructed particle tracks. This is generated more or less automatically by software routines that stitch together the pulses from the individual components to follow a particle through the detector. They take signals that say “this wire saw a pulse at this time” and convert them to “a charged particle passed through the chamber along this path.” This is the stuff that most nuclear and particle people actually deal with.

At that level, you’re still talking about terabytes of data, and billions of events, the vast majority of which show nothing interesting. In order to actually extract an interesting signal, they write computer codes to sift through the data, and pull out those events that look like they might contain the reaction of interest– the ones where the right collection of particles came out of the target after the collision. That gets you down to probably thousands of events, which you can then sort by energy and location within the detector, and so on to generate the plots and graphs that go into the actual paper.

So, when you say you’d like to see the raw data, what level of “raw” do you want? Should the D0 collaboration make available the particle track data for all of the 60-odd events that represent single top quark production? Or the thousands of events that were possible single top events, but turned out to have the wrong energy? Or the billions of events that were interesting enough to record in the first place? Do you want particle tracks, or raw detector signals?

An absolute statement like:

A figure then is the product of an operation upon the raw data, and that operation results in a loss of information.

sounds hopelessly naive to me. (To be fair, so do all statements of the form “Information wants to be free,” so it’s not a big surprise…) Yes, it’s true that more physical bits of information would be required to display the raw data than to display the processed figure. It’s also true that it takes more physical bits of information to describe a screen full of static than it does to describe the cheery, untroubled blue of a television tuned to a dead channel. That doesn’t mean that the screen full of static is more useful.

7 comments

  1. Of course you can go to the absolute extreme example and it will seem ‘ludicrous!’ to want raw data, but there are already lots of examples of raw data out there that are regularly used by lots of people. WMAP data is out there, and was analyzed by others, and there are all kinds of CO2 data sets available, from places like the NOAA. There’s even a group across town from me that posts some of their data available for all. I see no reason why, for the vast majority of experiments, we can’t all do the same.

    Your example is valid that all raw datasets aren’t going to be useful for everyone, but it is obvious to me that your argument’s example it is a discipline-specific issue.

  2. hello chad,

    I agree that the level of abstraction of particle collider data needs to be agreed upon, but I still find it fascinating the idea that one might, one day, click on a figure – wherever it is posted on the web – and get access to the data used to produce it, and the code needed to make sense of it. To stick with D0’s single top signal, it is based on detailed analysis of some 600 events. To characterize these in a very detailed way, one could just refer to a high-level root ntuple, containing not particles, but jets, electrons, neutrinos (from missing Et), b-tag information. I am thinking of sort of 100 variables per event – not far from what we deal with in most analyses, after the primary reconstruction and a selection of the information we want to study in detail.

    Think about it – in the future, you might just cut that information and paste into a web-based neural network or some other fancy algorithm, and then do all sort of tricks.

    People like Tony Smith, who has interesting ideas about the top quark at the tevatron but no access to the data – or to details about how the results were obtained, for that matter – would be able to touch ball. Many people playing with our ntuples would just create background noise, but a few could come up with interesting new ideas.

    Once Melissa Franklin said “one’s [high-level] ntuples are just like one’s genitals. You may be allowed to play with them, but not too much.” I disagree on both ways 😉

    Cheers,
    T.

    PS actually I’m going to post this comment on my blog, with a link to yours, do you mind ?

  3. Of course you can go to the absolute extreme example and it will seem ‘ludicrous!’ to want raw data, but there are already lots of examples of raw data out there that are regularly used by lots of people.

    Oh, absolutely. And particle physics is a really extreme example, in terms of the amount of data produced. This is very much a discipline-specific issue.

    The larger point is that even most of the scientists working on a given project will do most of their work with data a level or two removed from “raw.” My own postdoctoral research, for example, involved reducing hundreds of image files, and while the analysis procedure did require me to at least glance at every image, I really didn’t do much beyond that.

    The useful data for my purposes was a level up from that– the numbers generated by a series of automated fits to the image files. If a particular point looked really odd, I could go back and double-check the raw images, but that was pretty rare. Once we had the analysis codes set up, we ran all the data through the fits, and just worked with the numbers. The only time I looked at the image files was when it came time to try to make spiffy graphics for PR purposes.

    If even the people working on the project mostly don’t deal with the raw data, I’m not sure what the point of making it generally available would be.

    PS actually I’m going to post this comment on my blog, with a link to yours, do you mind ?

    Go nuts.
    Cross-linking is half the point of blogs, after all…

  4. Is there is a physics equivalent of the National Center for Biotechnology Information (NCBI)? This web resource maintains a collection of pretty much every DNA sequence published. It’s become standard fare to provide your DNA sequences to NCBI — these are cleaned up files — when you publish your paper. But pretty much every whole genome sequencing project even provides the data in the rawest form used (trace files with information on the quality of each base pair read). In fact, the bermuda accords ensure such practices.

  5. RPM – not as far as I know, since everyone’s techniques are all so different. There isn’t a critical mass of physicists doing the same kinds of experiments to justify it yet.

    Chad – I don’t get your larger point. If I’m working on atom trapping and think I’m seeing a weird effect in my own data, wouldn’t it be cool if I had access to your images to see if it’s an artifact of my instrument, or if others are seeing it too? And who do I ask? Do I ask you, the lead author of a paper from 5 years ago, your past advisor, or a current grad student in your past advisors lab? Given that you surely publish some form of data analysis technique in your papers (or at least reference some old article that does), I could then reproduce your data with new image analysis algorithms, or who knows what.

    Given that scientists as a collective attempt to be as transparent as possible (well, at least, we should….), what possible *disadvantage* is there to having accessibility to associated data?

    Here’s another really simple example. My friend in another lab needed the data out of an extremely old paper (conductivity curves for Lead or something). This was then going to get fit and integrated to make up part of the analysis he was doing now. The authors were long since retired, possibly even deceased. The quality of the scanned paper was terrible. Wouldn’t it be great to have access to the raw data?

  6. you can cook hamburger in your kitchen, but you can’t do much with a cow

    You just don’t have the right kitchen.

    But seriously, if you can work with it, so can someone else. And as PhilipJ points out, analysis techniques and tools change (and improve).

  7. “But seriously, if you can work with it, so can someone else.”
    But it is also important to take in to account the energy it would take someone else to work with it and if it would be worth it. At least with the work I do (detector development) there is no standard data format nor any useful formatting data embedded in the data files so access to my raw data is as useful as access to strings of slightly correlated unsigned integers.

    To take a less extreme version of Chad’s example look at area detectors. By raw data do you want the values at the pixels, the values background subtracted but not corrected, the values subtracted and corrected etc.

Comments are closed.