{"id":939,"date":"2006-12-18T09:56:54","date_gmt":"2006-12-18T09:56:54","guid":{"rendered":"http:\/\/scienceblogs.com\/principles\/2006\/12\/18\/you-cant-cook-a-cow-the-proble\/"},"modified":"2006-12-18T09:56:54","modified_gmt":"2006-12-18T09:56:54","slug":"you-cant-cook-a-cow-the-proble","status":"publish","type":"post","link":"http:\/\/chadorzel.com\/principles\/2006\/12\/18\/you-cant-cook-a-cow-the-proble\/","title":{"rendered":"You Can&#8217;t Cook a Cow: The Problem with Raw Data"},"content":{"rendered":"<p>Bill Hooker is a regular advocate of &#8220;open science,&#8221; and is currently supporting a <a href=\"http:\/\/www.sennoma.net\/main\/archives\/2006\/12\/where_are_the_data_can_i_have.php\">new subversive proposal<\/a>: to make all raw data freely available on some sort of Creative Commons type license.<\/p>\n<p>It sounds like a perfectly reasonable idea on the face of it, but I have to say, I&#8217;m a little dubious about it when I read things like this:<\/p>\n<blockquote>\n<p>First, note that papers do not usually contain raw (useful, useable) data. They contain, say, graphs made from such data, or bitmapped images of it &#8212; as Peter <a href=\"http:\/\/wwmm.ch.cam.ac.uk\/blogs\/murrayrust\/?p=28\" title=\"says\">says<\/a>, the paper offers hamburger when what we want is the original cow.&nbsp; Chris Surridge of PLoS <a href=\"http:\/\/www.plos.org\/cms\/node\/34\" title=\"puts it this way\">puts it this way<\/a>:<\/p>\n<blockquote>\n<p> A figure in a paper is a way of representing the raw data in such a way to best illustrate the point the author is making. A figure then is the product of an operation upon the raw data, and that operation results in a loss of information. <\/p>\n<p>The raw data could have been presented in a host of different ways possibly supporting other conclusions not thought of by the author. Equally if a reader had raw data compatible with that the author obtained wouldn&#8217;t it be useful if it could be processed in the same way for comparison? Wouldn&#8217;t it be much better for readers to have access not only to the figures in a paper but also to the underlying data and the transform that created it. In this way no information, neither implicit nor explicit, is lost.<\/p>\n<\/blockquote>\n<p>So if authors want to make their data openly <i>and usefully<\/i> available, they will need to host it themselves or find someone to host it for them.&nbsp; Many journals will host supplementary information, and many institutional repositories will take datasets as well as manuscripts.&nbsp; I have been saying for <a href=\"http:\/\/www.sennoma.net\/main\/archives\/2004\/05\/scooped_again.php\">some time<\/a> that it should by now be <i>de rigueur<\/i> to make one&#8217;s raw data available with each publication. This is very rarely done &#8212; even supplementary information, when I have come across it, tends to be of the hamburger-rather-than-cow variety and so not very useful.&nbsp; (The situation speaks sad volumes about the emphasis on competition over cooperation within the scientific community and, perhaps in many cases, about the quality of the raw data in question, if only one were ever able to see it; but I digress.)<\/p>\n<\/blockquote>\n<p>It may be a discipline-specific thing (Bill&#8217;s background is in the life sciences), but when I read that, my first thought is &#8220;These people have never looked at raw data.&#8221; At least in physics, we don&#8217;t generally present raw data in papers for the same reason that you can buy hamburger in grocery stores, but can&#8217;t get a live cow: you can cook hamburger in your kitchen, but you can&#8217;t do much with a cow.<\/p>\n<p><!--more--><\/p>\n<p>I mean, to take an extreme example, look at particle and nuclear physics. If you want access to the &#8220;raw data,&#8221; first you need to specifiy just how raw you&#8217;d like it. The lowest level, straight-from-the-detector raw data is noting but a collection of time series of voltage signals from the thousands of little detectors that make up the drift chambers and calorimeters and the rest. Even the people who run the experiments hardly ever look at this stuff, because it&#8217;s just not useful in that form.<\/p>\n<p>The next level of &#8220;raw&#8221; data in particle and nuclear experiments is a set of reconstructed particle tracks. This is generated more or less automatically by software routines that stitch together the pulses from the individual components to follow a particle through the detector. They take signals that say &#8220;this wire saw a pulse at this time&#8221; and convert them to &#8220;a charged particle passed through the chamber along this path.&#8221; This is the stuff that most nuclear and particle people actually deal with.<\/p>\n<p>At that level, you&#8217;re still talking about terabytes of data, and billions of events, the vast majority of which show nothing interesting. In order to actually extract an interesting signal, they write computer codes to sift through the data, and pull out those events that look like they might contain the reaction of interest&#8211; the ones where the right collection of particles came out of the target after the collision. That gets you down to probably thousands of events, which you can then sort by energy and location within the detector, and so on to generate the plots and graphs that go into the actual paper.<\/p>\n<p>So, when you say you&#8217;d like to see the raw data, what level of &#8220;raw&#8221; do you want? Should the D0 collaboration make available the particle track data for all of the 60-odd events that represent <a href=\"http:\/\/scienceblogs.com\/principles\/2006\/12\/single_top_quark_seeking_antiq.php\">single top quark production<\/a>? Or the thousands of events that were possible single top events, but turned out to have the wrong energy? Or the billions of events that were interesting enough to record in the first place? Do you want particle tracks, or raw detector signals?<\/p>\n<p>An absolute statement like:<\/p>\n<blockquote>\n<p>A figure then is the product of an operation upon the raw data, and that operation results in a loss of information.<\/p>\n<\/blockquote>\n<p>sounds hopelessly naive to me. (To be fair, so do all statements of the form &#8220;Information wants to be free,&#8221; so it&#8217;s not a big surprise&#8230;) Yes, it&#8217;s true that more physical bits of information would be required to display the raw data than to display the processed figure. It&#8217;s also true that it takes more physical bits of information to describe a screen full of static than it does to describe the cheery, untroubled blue of a television tuned to a dead channel. That doesn&#8217;t mean that the screen full of static is more <strong>useful<\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Bill Hooker is a regular advocate of &#8220;open science,&#8221; and is currently supporting a new subversive proposal: to make all raw data freely available on some sort of Creative Commons type license. It sounds like a perfectly reasonable idea on the face of it, but I have to say, I&#8217;m a little dubious about it&hellip; <a class=\"more-link\" href=\"http:\/\/chadorzel.com\/principles\/2006\/12\/18\/you-cant-cook-a-cow-the-proble\/\">Continue reading <span class=\"screen-reader-text\">You Can&#8217;t Cook a Cow: The Problem with Raw Data<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"1","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19,7,11],"tags":[],"class_list":["post-939","post","type-post","status-publish","format-standard","hentry","category-experiment","category-physics","category-science","entry"],"_links":{"self":[{"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/posts\/939","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/comments?post=939"}],"version-history":[{"count":0,"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/posts\/939\/revisions"}],"wp:attachment":[{"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/media?parent=939"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/categories?post=939"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/chadorzel.com\/principles\/wp-json\/wp\/v2\/tags?post=939"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}