In the same basic vein as last week’s How to Read a Scientific Paper, here’s a kind of online draft of the class I’m going to give Friday on the appropriate ways to present scientific data. “Present” here meaning the more general “display in some form, be it a talk, a poster, a paper, or just a graph taped into a lab notebook,” not specifically standing up and doing a PowerPoint talk (which I’ve posted about before).
So, you’ve made some measurements of a natural phenomenon. Congratulations, you’ve done Science! Now, you need to tell the world all about it, in a compact form that allows the viewer to make a good assessment of your results. Here are some rough notes on the best ways to go about this, starting with:
STEP ZERO: Know what point you’re trying to make. If you’re trying to interpret brand-new data, in the privacy of your own lab, office, or coffee shop, you can just slap together any quick-and-dirty sort of graph that you like, so you can see what you’re dealing with. When you’re preparing to present data to somebody else, though, you need to have a specific purpose in mind. Are you just comparing two numbers? Looking at how some property changes over time? Trying to characterize a distribution of numbers? Different goals will be best served by different types of presentations, and it’s important to have a clear idea of what you want to accomplish, so you can choose the right sort of graph for the job.
STEP ONE: Know your options. There are a whole host of different options when making a graph, as even a casual glance at Excel will show you. Some of these are versatile and powerful, some are only useful for such a ridiculously narrow range of purposes that I’ve never seen one used effectively. And, of course, if you look into data visualization, you’ll find a whole community of people who are really hard core about this stuff, crafting wholly original graphics specifically designed for each new data set they work with.
If you’re at a point where you have need of my input, though, there are really only a handful of options that you need to be aware of. As you get a better feel for your subject, you can start to explore others, but these will get you started. The “starter set” of data presentation methods, with appropriate applications is:
Option 0: Data Table On the one hand, you can’t go wrong with just giving the reader the numbers and having done with it. There’s almost no way to mislead people with a table full of raw data, and they can always do their own analysis of the data and make whatever kind of graph they like best. The down side of this is that it’s pretty much a punt– admitting that you really don’t have any good idea how to visualize your numbers.
Option 1: Scatter Plot This is the most basic form of plot you can make: you just plot one quantity versus another, as a bunch of dots on a square field. As basic as it is, though, it’s the best application for a lot of data in physics. If you’re looking at the position of a moving object, or the rate at which something heats up, a scatter plot is the way to go. It gives you a very clean presentation of the data, and an easy way to analyze the behavior as you make quantitative changes in some parameter.
A special case of the scatter plot is the logarithmic plot, which is the appropriate choice for data spanning a range of multiple orders of magnitude, such as book sales. This is a little trickier to interpret, but still has the same ability to give you a clear idea of quantitative trends.
The students in my class are currently making their own measurements of time-related quantities, and among the obvious choices there, the long-term measurement of a timer’s performance is best demonstrated by a scatter plot:
This clearly shows that there’s a linear drift for each timer, and a simple linear fit to the data lets you determine the rate of that drift.
Option 2: Bar Graph This is the other most common form of graph, the one you’re most likely to see in the newspaper or on tv. Here, you represent the magnitude of some quantity by the length of a horizontal or vertical bar.
A bar graph is the appropriate choice when you want to compare a small number of qualitatively different scenarios, and so it’s very common in social-science sorts of applications, comparing the earnings of people with different levels of education, for example. There’s no simple quantitative relationship between the different levels that would let you make a scatter plot (I suppose you could do “years of schooling” as one axis, but that’s kind of contrived…). If you’re comparing two or more quantities for each of your qualitatively different conditions, bar graphs give you a very quick visual way to identify the relative sizes.
Bar graphs are also one of the easiest forms to make annoyingly deceptive, so they need to be used with care.
Option 3: Histogram At first glance, a histogram appears to be just a special case of a bar graph, but it’s different enough to rate its own category. When you make a histogram, you’re not representing the size of a single parameter, but characterizing a distribution of measurements. For a histogram, the lengths of the bars represent the number of measurements falling into a particular range of values.
I’ve posted a bunch of histograms here over the years, for everything from the distribution of baby feeding times to commute times. A histogram gives you a good sense of not only the size of an effect, but the spread of the measured values. It’s the best way to tell whether you’re dealing with a nice, normal “bell curve” type distribution or something more complicated.
For the class of the moment, the measurement best represented by a histogram is the test of a cheap sand timer:
This graph lets you see right away that the two ends of the timer have different characteristic emptying times, and that there’s virtually no overlap between them.
Option 4: The color map. sometimes, you need to characterize the behavior of some measured quantity as you change not one but two other parameters. In such cases, you can make a visual representation of the system by mapping your measured value onto the color (or greyscale density) of points on a two-dimensional grid, where the grid coordinates represent the values of the two variable parameters. These are tricky to interpret, and a friend at work still gives me grief about the color plots of SteelyKid’s feeding schedule from back in the day, but it’s a category of plot that’s reasonably common these days, since computers have gotten powerful enough to make these more or less effortlessly.
There are tons of variants on these– you can turn a color map into a surface plot, or make a scatter plot with two different axes, or stacked-bar graphs– but these are the most basic methods for presenting data to someone else who might be interested in it.
STEP 2: Remember your audience No matter who you’re preparing the graph for, even other people in the same lab, they won’t be as familiar with the data as you are. Keep that in mind, and work to make your graph as self-explanatory as possible:
—Keep it simple Yes, you can use modern scientific software to make a scatter plot with fifteen different quantities, two separate vertical axes, with three inset plots and a 3-D surface map on the side. But nobody will ever be able to make sense of that unless they already understand the data as well as you do. As much as possible, you want to keep your graph simple: if you’re comparing things, pare it down to only the 2-3 most representative of whatever it is you’re trying to show.
—Make it clear If you’re plotting multiple quantities, make sure that they’re visually distinct. Don’t make a graph with points that are distinguished only by color (some people won’t be able to see that), but make sure that the shapes of the markers are easily distinguishable. Make sure that different datasets in a scatter plot aren’t on top of each other (unless that’s the point you’re trying to make), that the bars on your bar graph are wide enough to show up clearly, that your histogram doesn’t have an excessive number of bins, and so on.
—Label everything No matter what sort of plot you’re making, make sure that it has clear, comprehensible descriptive labels for everything that matters. Your axes should be readily identifiable, with appropriate units (or lack thereof) provided. If you’re plotting a calculated quantity that isn’t represented by absolutely standard notation, label the relevant axis in words. There’s nothing worse than coming across a scatter plot with axes labelled only with those squiggly Greek letters that nobody can keep straight (lowercase zeta? lowercase xi? who can tell?) and having to go searching through the body of a paper to find the definition. If you’re using multiple symbols, there should either be a clear legend in the plot itself, or a clear statement of what each represents in the figure caption.
A good visual presentation of data can make a complicated result come clear in an instant. A bad visual presentation of data will remain baffling no matter how many times you read its description. There aren’t any hard and fast rules here that can never be broken, but if you take this advice as a starting point, you’ll be fairly safe.
(I’m probably forgetting a few items that will come to me five minutes after this psot goes live. Just in case I don’t think of them, though, feel free to point them out in the comments.)
Some extra cautions regarding color plots:
1. Certain choices work better than others. Rainbow plots (of which I have admittedly produced too many since this color scale option is frequently the default) run into the same issues with color blindness as multiple lines that differ only by color; furthermore, if it is for a print publication, somebody may well print it out on a black-and-white printer (this is less of a problem than it once was, but it hasn’t disappeared yet), and in the transition the color scale becomes non-monotonic. Grayscales and variants thereof (i.e., have the change in only one color parameter, typically saturation, denote the magnitude) often work best. If you are emphasizing the difference from zero of a quantity which can have either sign, make white represent zero and use two contrasting colors (but not red/green, again because of color blindness) for positive and negative values.
2. I have seen some people try to represent multiple independent variables by varying different color space parameters for each variable. If you are inclined to try, terminate that impulse with extreme prejudice, for you are implicitly assuming that your audience’s hardware and wetware have the same color calibration as yours (they don’t, which is why such attempts invariably fail).
I didn’t see any mention of error bars which are usually crucial.
How about a list of things to avoid including?
1) Titles or legends that repeat the axis labels
2) Grid lines (or at least keep under control)
3) Gratuitous trend lines or interpolation (especially evil is using every possible trend line until one fits. Behold! My 6th order polynomial fits the data perfectly!)
Complaints from biology, some redundant with the excellent article. We are better at deception than any of you others.
1) Horrible overuse of bar graphs with error bars, even when there are only 3 data points per condition. By Darwin’s liver, just show us the damned data. I’ve heard folks call plotting the data a “dot plot” – they have never seen such a thing so they need a new word for it. It is also common to see “Box plots” where they forget to tell us what the box means (often 25th and 75th percentiles) or what the wiskers mean (often 5th and 95th) and some other details.
I dispute that bar graphs are appropriate just because you have “group” on the X-axis rather than some quantity. They do become more necessary when the number of data points gets too large to plot the data conveniently, so a summary is better.
2) Not plotting on log scale when that is almost always better. People are plotting number of cells (or tumor size) vs time and not using logs. Often this is by design – they don’t want us to see that on log scale either 1) the lines are almost perfectly parallel, it was just the starting amounts that differed (and hint: your statistical test should be about the slopes of those lines) or 2) the lines are not nearly straight lines so you fear showing them.
3) Doing analysis on the log-transformed data (or it even starts as logs: with PCR you measure the number of doublings), but then plotting anti-logs. They do a bastard thing to estimate the standard deviations of the anti-logs that makes the errors seem symmetrical about the mean when that is a) not what the statistical test employed and b) makes no sense. The plotted mean is then not the mean of the anti-logs (it’s the anti-log of the mean, or “geometric mean”), and sometimes when the data is also plotted, you can see that with your eyes. Justification is by arguments from popularity or authority (of the previous sinner), never from “good”.
Sadly the argument that the reader won’t really get it if the Y-axis is log scale is not without some truth in this field. Hey, we do squishy science because we hated math, and what is an anti-log anyway, I’m taking 2 to the power delta-delta-CT instead.
4) Tremendous waste of real estate, with the majority of the figure being white space.
5) Showing heatmaps (like a table where colors represent quantities) without clearly showing what the colors mean. This is not saying what the units are in a graph, and is an abomination. Impossible you say – no, it happens. Usually, they don’t want you to know: for example they want you to think it is fold change but it’s really standard deviations from the mean. They sometimes admit it’s stdev’s from mean (or such) – in those cases you can’t tell if the difference was 15% or 15000%.
6) Fussing about creative overuse of color, but doing more basic things without thought.
7) Principal components shots where the axes are not labeled – cause I’m showing components 1, 3 and 4. It looked better than 1,2, and 3 we thought, cause component 2 picks up that half our samples have crappy RNA, and makes it hard to see the split in the groups we are trying to emphasize. Alternatively rotate a 3D depiction so that component 2 points in your face, and becomes more invisible.
There’s also the really lying trick of showing such a shot after you selected those vectors of data that are most significant, and use only them, but not say so. This is a flaw of honesty rather than graphing.
A general trend is to care only about what you want reader to think, not what reader wants to know. For example the critical plot every reader wants to see – don’t show it, only show a subset that makes your pet point.
How to make a good figure needs to be taught much more. Thankyou Prof Orzel.
This is a bit off-topic, but I think it’s quite interesting.
I was watching the finalists videos of the spacelab competition (where teenagers come up with experiments to be done in the ISS) and this kid caught particularly my attention (video here: http://goo.gl/HIbKf ) -at the end when there is written in an anotation that this could be useful to medical research. Do you really think that we could use sponge-like spicules to somehow concentrate drugs near the tumors?
Does anyone know if there are sample data sets online with which students can practice making plots? This would be a great exercise to do in class, to have them work in groups to see who can come up with the best looking plot.
Best advice of the article is keep it simple, no need to complicate things in presentations of complicated data. The visual simplicity is key to understanding various graphs and gathering the needed information from them. Confusing colors are a definite no-no.