The Scores Are Falling?

The science story of the day is probably the Department of Education Report on science test scores, cited in this morning’s New York Times. They administered a test to fourth, eight, and twelfth-graders nationwide, aking basic science questions, and compared the scores to similar tests given in 1996 and 2000. (Update: John Lynch has some thoughts, and includes a couple of the questions.)

The headline-grabbing result is that the twelfth-grade scroes are down over the last ten years, while the fourth-grade scores rose. The educational system of the nation is clearly in free-fall, and we’ll all be speaking Chinese by the end of the decade…

Or, possibly, there may be a bit less to the story than there appears… (Continued below the fold).

I haven’t read the full report, and I’m not likely to (I have to sit in on a class, meet with a half-dozen students, meet with the committee that makes tenure decisions, attend a colloquium, and somehow squeeze in an hour in the lab so my student can do useful work tomorrow…), but the scores reported in the Times piece don’t really leap out at me as a gigantic criisis. The increase in the fourth-grade scores is about five percentage points in the number of students scoring at the “basic” or “proficient” level (from 63 to 68), while the decrease in the twelfth-grade scores is about three percentage points (from 57 to 54). This is deemed statistically significant, but I’m not terribly worried.

Also, the decline in scores is only relative to the 1996 numbers– between 2000 and 2005, the scores actually went up slightly, from 52 to 54. So you might just as well say that scores have increased over the last five years, rather than wringing your hands over the last decade’s decrease. Yeah, the increase might not be statistically significant, but saying that three percent is a crisis but two percent is not requires an awfully Manichean view of statistical significance…

More importantly, this comparison is really between two different groups of students– the twelfth graders of 1996 are not the same people as the twelfth graders of 2005. At least, I devoutly hope that nobody took the twelfth-grade test in 2005 who previously took it in 1996. And given that the test is only administered once every four or five years, it’s impossible to say anything about the quality of the data– maybe the Class of ’96 was just an exceptionally strong bunch, science-wise, while the Class of ’05 are a bunch of laggards. Having taught five different classes freshman physics, I can easily believe that test scores would fluctuate by a couple of percent from one year to the next.

A better comparison might be to look at the trajectory of one group of students, but unfortunately, the data don’t quite allow that, as 2005’s twelfth graders were third graders in 1996 (unless I’m miscounting), and didn’t take the test. You can get a rough sense of the trajectory, though, by looking at the progress of scores, so the 1996 fourth-grade group (plus or minus a year) goes from 63 to 59 to 54, a drop that could sensibly be attributed to the fact that senior-level questions ought to be harder than fourth-grade-level questions. How do they stack up to their peers? Well, we don’t have full data, but 2000’s fourth-grade class suffered exactly the same drop from 63 to 59 in th move to eighth grade, and the eighth-grade class of 1996 dropped an amazing eight points, from 60 to 52 over their final four years. That doesn’t look like a big change in the quality of education over the last several years.

Of course, if you’d like to believe this represents a real crisis, and want a way to blame it on George Bush, the Times has got you covered:

Some teachers cited the decreasing amount of time devoted to science in schools, which they attributed in part to the annual tests in reading and math required by the No Child Left Behind law. That has led many elementary schools to cancel some science classes. On average, the time devoted to science instruction among elementary teachers across the nation declined from a weekly average of 2.6 hours in 2000 to 2.3 hours in 2004, Department of Education statistics show.

Again, the drop is not terribly impressive– the real scandal here is that science only gets two and a half hours a week. The kids spend more time than that eating lunch, for God’s sake… But if you want something to blame for the arguably-significant drop in scores, NCLB is as good as anything.

8 comments

  1. The effect of NCLB is that the kids stop getting taught a month ealier than they would have (testing occurs in mid may, and the school year ain’t over ’til mid June). At the beginning of the year, the first month is testing and eval to find out which parts of the tests the kids are most likely to fail, and can thusly be covered more heavily. You’ve lost a month of teaching on either end, and you’ve screwed bits of the curriculum out of time since the kids will likely at least meet the minimum requirement. It just means playing catch up in those areas next year, after you’ve wasted a month determining that not covering those areas in the intended detail have now left the kids behind in those areas. What a viscious little circle we’ve created. But don’t take my word for it, this is just what my wife, the teacher of 11 years tells me.

  2. Hopefully the national test of K-12 science proficiency is a better representative sample than the freshman class in any given university class… I’d expect your class to show significantly larger year-to-year proficiency fluctuations than the national mean.

    It would be interesting to see if there are statistically significant cultural causal factors on these time scales; any particularly dumb TV shows that were popular on the relevant time scale and likely to actively decrease scientific comprehension.
    Or vica versa; maybe we’ll see a “CSI” effect in the next cohort 😉

  3. Hopefully the national test of K-12 science proficiency is a better representative sample than the freshman class in any given university class… I’d expect your class to show significantly larger year-to-year proficiency fluctuations than the national mean.

    Absolutely.
    The fluctuations I see are probably more on the 5% level, if not bigger, so a percent or two nationally wouldn’t surprise me.

    The problem is, given that they only give the test every four or five years, there’s really no way to say what the natural fluctuations are. For the data to really be useful, you’d need annual scores, and the only really sensible comparison would probably be to something like a three-year moving average.

  4. I’m a little surprised to see a physicist waving off statistically significant results because they didn’t “leap out” at you. One hopes you’ll get a similar criticism on your next NSF grant application from some math-ignorant reviewer and take it with equal aplomb. (It’s also disingenuous to later call the results “arguably significant”.)

    I won’t argue with your political stance except to say that if the scores had risen, the White House would be all over television claiming a victory for NCLB and the Bush education policy.

  5. Oh Come on, Dr. Pain. Surely there’s bound to be deviations due to whatever reasons, it’s not a mass of electron measurement. 2-3% change, up or down, is no reason to be alarmed. I think general state of education (which to me sounds like it has not changed much in the past 10 years) is what we should be concerned about – absolute value, rather than some insignificant relative change.

    I am not sure how pointing out (correctly I might add!) the differences between 5-year or 10-year old datasets, which to me simply do not paint a self-consistent story at all – equals “math-ignorant”.

    It seems the kind of situation where journalists can make any kind of point – “test scores are up!” or “test scores are down!”, depending on the spin we want to put. Since people don’t go into the details of the actual data, the way Chad did, and because, let’s face it, most of the people, including journalists are “math ignorant”, the public will buy anything.

    I think there ought to be a “fact checking” police that can say to any newspaper or internet story “Now, wait a minute…”

  6. “Statistical significance” is good, but it only goes so far. Saying that a difference between two results is statistically significant is a narrow mathematical statement that the difference is larger than can be explained by random fluctuations due to the sampling process. It’s not the whole story, by any means– that’s why, in precision physics measurements, they always include extra uncertainty due to systematic effects that can cause changes in the measured values that have nothing to do with statistics.

    “Statisticalyl significant” is also often an artificially rigid distinction. A colleague told a story about hearing a presentation about some medical experiments, where the inclusion of one patient who had been excluded changed a result from statistically insignificant to statistically significant. To a physicist, that just sounds crazy– if your signal is that dependent on the precise value of one data point, you need to do the experiment again, or nobody is going to believe you.

    In this specific case, I’m happy to believe that the difference between the 1996 scores and the 2005 scores is larger than can be explained through statistical uncertainty alone. I’d need to see some more data before I’ll believe that the difference is large enough compared to the natural variation from one class of students to the next (or one test to the next) to constitute a real crisis in science education.

  7. A statistically significant difference is not necessarily a substantive difference. I can show you tons of papers in AI that show statistically significant improvements in algorithms that no human using those algorithms would ever be able to detect.

    Also, the bigger the sample size, the more likely you are to find a significant difference. That’s simply because null hypotheses are much narrower than alternatives (mu = 0 vs. mu != 0, for example), so with a large enough sample, the tiniest deviation from the null can be statistically significant.

  8. maybe it’s a good example why people should be taught statistics? someone ought to blog about that!
    oh, wait…

Comments are closed.