There have been a bunch of interesting things written about education recently that I’ve been too busy teaching to comment on. I was pulling them together this morning to do a sort of themed links dump, when the plot at the right, from Kevin Drum’s post about school testing jumped out at me. This shows test scores for black students in various age groups over time, but more importantly, it demonstrates one of my pet peeves about Excel.
If you look at the horizontal axis of this plot, it shows regularly spaced intervals. If you actually read the labels, though, you’ll see that they’re anything but regular– the first few points show test scores at four- or five-year intervals, then there’s a cluster of two-year intervals, then some more long gaps.
These points are plotted at regular intervals because Kevin used Excel to make the graph (the ugly colors are another dead giveaway), and did what looks like the reasonable thing to do when you want to plot lines connecting your data points, namely choosing “Line Plot” from the plot options Excel offers. As I tell my intro physics students over and over, though, “Line Plot” is never the right choice, because of this exact problem. If you have both x and y values (say, test scores and the years in which those scores were reported), you want “Scatter Plot,” which will properly space the data on the horizontal axis. “Line Plot” always plots points as evenly spaced, because Excel is designed for middle managers who deal with sales data that occur only at regular intervals (and who secretly hate and fear real math).
What difference does it make? Here’s the data for 17-year-old students, plotted properly in the original report:
In the “Line Plot” version, it looks like something incredibly dramatic happened three points into the data series, with scores shooting way up overnight, then abruptly flattening out. Those first few points have twice the interval between them of the next few, though, so they ought to be twice as far apart horizontally. The original plot gets that right, and as a result, the big step up looks far less dramatic, and the flattening-out much less abrupt. There’s still a fairly clear step, but with the points properly spaced out, this looks a lot more like a small fluctuation above a general slow increase than a one-time dramatic jump in scores.
That doesn’t mean there isn’t something interesting going on in the early 1980’s– that’s still a significant increase in a short period– but when the plot’s done right, you get a more accurate picture of what’s going on. Which is to say, a muddled and slightly ambiguous picture that would really benefit from somebody with a time machine going back and getting some data from 1982 and 1986 so we have a clearer idea about what test scores really did in that period. Get on that, social scientists.
And the rest of you, remember: Excel was written for (and possibly by) monkeys, and no matter how reasonable it might seem, “Line Plot” is never the right choice.
David Hilbert famously complained that the problem with physics was that it was too hard for physicists. I think it’s evident here that Education is too hard for, well, whatever you call the pundits, administrators, bureaucrats, and other self-styled “educators,” particularly those who freely opine about education without spending any time in front of a classroom.
Excel was written for (and possibly by) monkeys
Tell us how you really feel, Chad.
That what should we use instead?
(Instead of Excel, not lineplot which I don’t think I’ve ever used.)
The other problem (with the original graph as well as yours) is that the Y axis cuts out most of the range below the lowest points, making the changes look much larger than they are.
The increase in black scores between 1980 and 1988 looks like more than a doubling, but it’s really only about 13%. This isn’t strictly deceptive, as the axis correctly shows what’s been done, but it’s very misleading to anyone who doesn’t examine the graph closely.
There are a lot of different programs out there that are designed to handle scientific data, and thus have sensible default settings, unlike Excel. I’m in a PC world these days, and tend to use SigmaPlot for anything serious. It’s on the expensive side, but it will do basically anything you want, and its help files are orders of magnitude better than Excel’s, so you have a fighting chance of finding the right command to do what you want.
The astronomers in the department tend to favor IDL, which I know basically nothing about. That’s more of an image-processing package, so it’s probably overkill for scatter plots. The person to ask about low-cost options is Rhett, who does a lot of data analysis using free tools.
Excel is certainly not a scientific tool. One of my pet peeves is that large numbers on axes are for example formatted as “1E10” (unacceptable for publishing) where they could have easily formatted it as 10 followed by a superscript 10 (which I’ve done many times in plotting routines I’ve written).
[IDL is] more of an image-processing package
As an IDL user, I’ll speak up here. Yes, it does image processing. It also does a whole bunch of other stuff: Fourier and wavelet transforms, matrix operations, and a whole bunch of routine data processing. There are several satellite missions for which the analysis code is almost entirely in IDL (some decommutators, which pull the actual data out of the telemetry stream so that they can be converted into a useful form, are written in C). Yes, it’s overkill if all you want is a quick-and-dirty scatter plot, but if what you want is a publication quality scatter plot, IDL does well provided you remember to select hardware fonts for your PS plot (as Chad correctly notes, Excel is a bad choice if you are aiming for publication quality graphics). However, all of this power comes at a price, which is not cheap–indeed, it probably would make SigmaPlot look cheap. There is also something of a learning curve because you must, in effect, learn a new programming language (the acronym stands for Interactive Data Language).
I know people who swear by Gnuplot, which is free but limited to *nix systems (and you may have to take the time to compile the code yourself). I used it for some graphics back when I was a grad student, but I haven’t touched it in years.
Gnuplot has been available for windows and macs (as well as *nix) for some years now. It works well and the documentation is decent, but by default the graphs are not publication quality, and the interface is command-line based. Gnuplot is quirky, but it is also amazingly fast, and it can easily plot data with thousands and thousands of points. All that said, it still beats Excel by far.
R (http://www.r-project.org/) produces publication-quality graphs with not too much effort, but it is far from being a click-and-drag simple plotting program.
Grace (http://plasma-gate.weizmann.ac.il/Grace/) is free, and awesome, as long as you avoid clicking on any of the buttons on the main screen which never do what you expect. Unfortunately, it only exists on Windows via cygwin.
Unfortunately, there is no modern version of Cricket Graph for windows, which would probably be ideal for lots of people.
Some extra help for Gnuplot users:
http://gnuplot-tricks.blogspot.com/
http://www.phyast.pitt.edu/~zov1/gnuplot/html/intro.html
Other free solutions (both “as in beer”, and Stallman stylee) include
R http://www.r-project.org/
SciPy http://www.scipy.org/
Both run on most any computer you’re likely to be using.
It looks like scidavis (http://scidavis.sourceforge.net/index.html) and qtiplot (http://soft.proindependent.com/qtiplot.html) might be reasonable alternatives to origin and sigmaplot.
The underlying problem with these data is that the demographic composition of the tested students has changed. Rather than being a flat line (which is what you correctly show), test scores have increased steadily when broken down by demographic subgroups.
To put this another way, how do we square the first figure and the second?
The underlying problem with these data is that the demographic composition of the tested students has changed. Rather than being a flat line (which is what you correctly show), test scores have increased steadily when broken down by demographic subgroups.
To put this another way, how do we square the first figure and the second?
I’m not trying to claim that there isn’t a real increase– as I said, there is a clear increase from the start of that dataset until the end of it. The two figures are the same data– the lower line in the second figure is the light-green line from the first figure. When you do the horizontal axis correctly, the trend looks less like a sudden jump followed by an abrupt flattening, and more like a steady upward trend with a good bit of noise (and you would expect somewhat more noise in the data for black students, given the smaller population– something like a factor of three more noise just from statistics). But there’s definitely an upward trend.
With effort, duct tape, and bailing wire, you can get a line plot to display this type of data properly in Excel. However, it requires knowing exactly how to input your data into your data grid and remembering what those unfindable check box with the arcane names do.
This should be blamed entirely on me, not Excel. I wanted to plot this correctly, but I had no idea Excel had an option to do that. That’s my fault, not Microsoft’s.
That said, I have to say that I don’t think the two charts look quite as different as you’re suggesting. (Dangerous words to a physicist, I know.) I honestly don’t think anyone looking at the line plot version comes away with a wildly incorrect view of what happened.
As for the y-axis, I disagree. It would be misleading to plot data like this starting at zero. The lines would essentially look flat, showing virtually none of the changes. This isn’t a case where the data starts at zero and the chart cuts it off. It’s a case where the data fluctuates in a small range around a particular value for the entire time series.
FWIW, I’ve corrected the plot in the original post. Looking at it, I think the real issue isn’t so much the line/scatter difference, but that I formatted my chart tall and thin, while the NAEP chart is short and wide.
Chad,
In this case I must disagree with your total prohibition on line plots. You’re assuming that this is a time series with a constant sampling rate. Such is correct for regularly sampled data, but for this data set the sampling rate is never constant.
However, as you’ve presented it here, the point in this data is the fact that there has been changes over time, either plot can be used to make that point. Not how much change, as a function of time, but the fact itself that there has been change.
Your valid suggestion for improving data transmission simply does not provide support for a total ban.
No, agm, Chad is entirely correct here. In the special case where your data really are sampled at regular intervals, line plot and scatter plot yield the same result. If the sampling is irregular in any way (that can include missing samples), they will yield different results, and in physics applications that difference matters. Since it is not significantly more difficult to generate a scatter plot than a line plot, there is no reason to use line plot, at least if the X axis is quantifiable (and if it isn’t, you probably should be using a bar chart or a pie chart instead).
Excel drives me nuts. I have to change so many defaults to make a plot look the way I want.
Eric,
A scatter plot is a cross-plot. You use a cross-plot to determining whether a relationship exist between two quantities. By demanding the data can only be properly displayed on a scatter plot, and hammering on the appearance of the slope around 1984, Chad is explicitly stating that he thinks that df/dt contains information, which requires that there is a relationship, scores = f(t).
This is a fundamental misunderstanding of the nature of this data. The data is not a series of samples of a function of time. It is a series of samples of cohorts, that happen to have been taken at different times. There is no reason to think that there should be time dependency here – for example, the cohorts could’ve been sampled geographically, for example, rural vs urban vs suburban, and only for one year. What if it was population counts by country? What if it was the value of a stock in one’s portfolio – are you going to argue that the DOW varies dominantly as a function of time?
A line plot does something different. It can display variability without assuming a relationship. In that respect it does contain less information than a scatter plot, but this may be desirable. For example, Mike the Mad Biologist’s comment about demographic shift would imply that the variation is unlikely to be a function of time – df/dt in the scatter plot would not contain real information, and representing it with a mechanism that assumes real information is represented can be misleading.
agm,
In what way does a line plot not assume a relationship? Line plots emphasize the *lines*. The lines are the interpolations between data. If there’s no suggestion of a relationship, the interpolation doesn’t make sense. In that case, the proper graph would be something like a bar chart or the data points without connecting lines.
There is at least one use for Excel’s line plots, though. If you have regularly spaced data and want to label the horizontal axis with something non-numerical (months, days of the week, seasons, …) it’s simpler to do with the line plot than with the scatter plot.
This comment is waaaay late, but for the record:
Line charts in Excel CAN correctly display the data if it knows what the units are for the x axis. It stupidly assumed that 1971, 1975, etc. are just names like red, blue, etc. But I just made a line chart where the axis is specifically formatted with the date type, and it displayed correctly with unequal intervals.