Neither a Neuroscientist Nor a Statistician

A bunch of people I follow on social media were buzzing about this blog post yesterday, taking Jonah Lerher to task for “getting spun” in researching and writing this column in the Wall Street Journal about this paper on the “wisdom of crowds” effect. The effect in question is a staple of pop psychology these days, and claims that an aggregate of many guesses by people with little or no information will often turn out to be a very reasonable estimate of the true value. The new paper aims to show the influence of social effects, and in particular, that providing people with information about the guesses of others leads them to revise their guesses in a way that can undo the “wisdom of crowds” effect completely– you can get tight clusters of guesses around wrong values for purely social reasons.

The blog post is very long and takes quite a while to come around to the point, which is really two related claims: first, that Lehrer cherry-picked the example he used in his column, and second, that Lehrer was deceived by unjustifiable claims made about the research. The whole argument centers on this table:

i-b099efff25b3e643ae01455e959cfc98-wisdom-of-crowds.png

The cherry-picking charge is, I think, on the mark– Lehrer used the median value for the third question as his illustration, because that particular aggregate value of the guesses falls within 1% of the true value. However, it’s the only one of the 18 aggregate values to get that close– in fact, it’s the only one to come within 10% of the true value. Using it as the only example is a little sleazy (though not really outside the bounds of normal journalistic practice, which says more about journalistic practice than anything else).

As for the other claim… Well, as I said on Facebook last night, the author of the post, Peter Freed, lost me completely when he wrote “The geometric mean (circled in blue, below) – whatever in God’s name that is.”

This puts me in a slightly odd position, I realize, as I have often expressed skepticism about statistical hair-splitting in psychology experiments. In fact, another much-derided piece by Lehrer on the “decline effect,” which he describes as an apparent loss of validity of scientific studies over time struck me as probably the result of taking some fairly arbitrary statistical measures a little too seriously.

In this case, though, I think the criticism of the research, and thus of Lehrer for taking it too seriously, is off base. And the manner in which is is expressed really rubs me the wrong way.

For one thing, the geometric mean is a fairly standard technique in statistical analysis for dealing with certain kinds of data. It’s something to be used cautiously– as Bill Phillips noted once in a group meeting, one second is the geometric mean of a nanosecond and thirty years (or whatever choice of absurdly large and small values you prefer– an attosecond and the age of the universe, or whatever)– but it’s a perfectly legitimate characterization of certain types of distributions.

If you’re going to take people to task for abusing statistics, you really ought to know what it is and when it’s appropriate. Brushing it off with a “whatever in God’s name that is” comes off as just a nicely dressed version of the “just plain folks” anti-intellectualism of groups like the Tea Party, rejecting overly eggheaded science.

It’s particularly bad because the geometric mean is defined in the paper. Twice, once in the Results section, and once in the Methods and Materials section. Both of these mention also include detailed justifications of why it’s an appropriate choice. Freed does airily dismiss the Introduction and Conclusion sections as too spin-laden to be worth reading, but evidently didn’t look all that closely at the sections he claimed to focus on, either.

Since this is the point on which the whole argument turns, Freed’s proud ignorance of the underlying statistics completely undermines everything else. His core argument is that the “wisdom of crowds” effect is bunk because the arithmetic mean of the guesses is a lousy estimate of the real value. Which is not surprising, given the nature of the distribution– that’s why the authors prefer the geometric mean. He blasts Lehrer for using a median value as his example, without noting that the median values are generally pretty close to the geometric means– all but one are within 20% of the geometric mean– making the median a not-too-bad (and much easier to explain) characterization of the distribution. He derides the median as the guess of a “single person,” which completely misrepresents the nature of that measure– the median is the central value of a group of numbers, and thus while the digits of the median value come from a single individual guess, the concept would be meaningless without all the other guesses. Median values are often the more appropriate choice for data that are unbounded on the high end, and thus tend to be skewed by outliers– as noted in the comments, the original “wisdom of crowds” paper a century ago used the median value as its aggregate guess. And, as the authors note at the end of the Methods and Materials section, the median should be equal to the geometric mean for a log-normal distribution, so it is a perfectly reasonable choice to characterize their data.

Even the claim that the data fail to show a “wisdom” effect (“Every single question, the arithmetic mean, and really even the geometric mean, was from a human standpoint wrong, wrong, wrong, wrong, wrong and wrong. The end.”) is off the mark for statistical reasons explained in the article. Though I prefer the description offered by one of Freed’s commenters, who makes an analogy to a “Fermi problem.” Given the open-ended nature of the questions, and the fact that they are by design questions that people in the sample wouldn’t have expert knowledge of, the guesses span many orders of magnitude, so you’re doing pretty well if you can get the answer to within a factor of 10. For this sort of process, the geometric mean is an appropriate choice of aggregate measure, and the fact that all the aggregated guesses come within a factor of 3 (and most within a factor of 2) is reasonably impressive.

So, essentially, Freed’s post is a long discussion of the dangers of being “spun” through not carefully considering scientific data that is based largely on not reading the underlying paper very carefully. The authors actually address most of the issues Freed raises, and have solid arguments for why they did what they did with their data. He doesn’t engage with their arguments at all, preferring to pontificate grandly about how it’s all just spin.

Moreover, this table isn’t even the point of the research– it’s just a summary of the starting conditions. The actual research looks at how the initial measures of “wisdom” change as the information available to the guessers changes, and when they look at that, they see a fairly clear narrowing of the range toward results that aren’t necessarily better than the initial guesses– in fact, the final answers are generally worse. At the same time, the self-reported confidence in the answers increases, as the guesses all converge toward the same (wrong) value. That’s the core finding, which is exactly as Lehrer describes it in the article. You can argue about how significant this effect really is– but again, if you’re going to do that, you need to engage with what they actually wrote.

So, despite all the social-media buzz about this (which I suspect has more to do with the way the conclusion plays into people’s pre-existing negative opinions of the mainstream press), I find myself deeply unimpressed. It’s an argument about the misuse of statistics by someone who proudly proclaims ignorance of some of the key statistical techniques used in the research. And it’s an argument about the dangers of not reading research carefully enough by someone who apparently hasn’t read the relevant paper very carefully.

6 comments

  1. Hey – I really like this response to my blog post & appreciate it sincerely! However in my defense, I said “whatever that is” in an attempt to empathize with my readers; I was not going to bother readers with an explanation of logarithms that had nothing to do with my point, which is that Lehrer used a statistic from a column the authors did not recommend the use of, and did not use a statistic from a column that the authors did recommend the use of. This was a paper on journalism, not statistics – Call me old school, but I believe that a journalist should represent papers as-is, not as-coud-have-been. Second, when all six answers are considered, even in the median column, rather than just cherry-picking the best one, the crowd does not look wise. I didn’t bother with the rest of the paper because that’s all I cared about – it’s premise, as I said in the piece. And third, as I will talk about in an upcoming blog post, statistics’ dislike of long right tails is *not a scientific positiion.* It is an aesthetic position that, at least personally, I find robs us of a great deal of psychological richness. If you care about a mean per se, go ahead and use a median. But to understand the behavior of a crowd – a real world crowd, not a group of prisoners in segregation – or of society in general, right tails matter, and extreme opinions are over-weighted. The general public knows this, as all of Lehrer’s examples – presidential elections, American Idol, and stock markets – manifest repeatedly the outsized effects of right tails on the mean. To me that is the bigger issue. I will try to get to this in my next post!

  2. Call me old school, but I believe that a journalist should represent papers as-is, not as-coud-have-been. (sic) … I didn’t bother with the rest of the paper because that’s all I cared about – it’s premise, as I said in the piece.

    How do these two statements jive? “I want to represent the papers as is, without tainting the author(s)’s original message,” and, “I will focus only on small portions of this paper and neglect what does not fit the argument I am trying to construct,” seems a bit contradictory.

  3. I’ve let this go for the whole day, because I really don’t know what to say in response. I agree that criticizing Lehrer for cherry-picking the best data point is valid, but after that… I just really don’t have a response to blithely waving off a vast amount of research into the statistics of various types of distributions. I mean, I guess you can assert by fiat that there is no mean but the arithmetic mean, but it doesn’t exactly improve my opinion of your credibility.

  4. Thank you for illuminating this issue. I would like to add that this misuse of statistics extends to pharmaceutical studies as well. I was recently at a conference and someone in the audience asked the presenter (a Ph.D.) why the study she was presenting on neurofeedback training was so small (albeit, very impressive). She answered that the study was actually quite large (n=121) for an independent University study. She continued to say that healthcare practitioners are conditioned to believe that larger studies are better, however, pharmaceutical companies often fund their own research; therefore, they have the resources to champion larger studies. When the study is conducted by an independent institution such as a university there are often less participants, but the results are unbiased and often more credible than the larger studies.

  5. Chad – hey – I’m confused – where did I say (let alone decree by fiat) that the arithmetic mean is the only average worth having? I really don’t believe that. I feel like you focused on the right tail of the points in my post and not my median point, which is about fairly representing the data of a science article in a pop neuroscience article. Are you saying Lehrer did a good job representing that paper? That’s the real question.

    I’m not saying he did it intentionally – my whole post was aimed at saying he didn’t – but his datapoint misrepresented the paper’s underlying data. But when a reporter cites a single central tendency value with no SD, no skew, no kurtosis I think it is fair to imagine that the median reader won’t get a fair picture of the underlying data bc they are assuming its a bell shaped curve. You and I both know that in science nobody would present the median all by its lonesome; really only means are talked about that way. In my experience medians are always presented with additional information that allow the reader to understand the shape of the curve. As we both know, some people really care about tails and want to know if the curve is skewed.

    By far and away, however, my biggest point was that when you report the results of a study you need to report the outcome measure used in the paper, not some other measure they could have used but didn’t.

    I’m interested in whether you think the WSJ article fairly represented the underlying study. To me, that’s the heart of the curve – this median/mean business is a right tail.

Comments are closed.