Illustrations by Tamar Haber-Schaim
A substantial fraction of statistical misunderstandings fall into a half-dozen categories - the Six Deadly Sins of Statistical Misrepresentation. I offer examples of these errors below; while they are drawn mostly from criminology and aviation (domains with which I am particularly familiar), they have plenty of counterparts elsewhere. My hope is to help audiences of the popular media - that is, just about everybody - to detect difficulties often apparent only to those with independent information about the subject, and to discourage fellow citizens from taking a strong position or course of action based solely on a press report.
Statistics about unusual sub-populations are often interpreted as applying to an entire population. Such extrapolation can yield misleading and even ludicrous results.
There's a problem here: the only contributors to the data analysis were people who had suffered heart attacks - and had survived them. Thus, the newspaper's implied advice to the broader population - "keep cool" - may have been misguided. Although the study indicated that vigorously "blowing off steam" seems to raise the immediate risk of heart attack, such releases of tension could serve to reduce the overall long-term risk. But people who had freely expressed anger throughout their lives - and who had, perhaps as a result, managed to live to old age without a heart attack - could never make it into the researchers' sample. It is also possible, though perhaps far-fetched, that those whose heart attacks were instigated by anger were better able to survive them than are other such victims. Were that the case, angry people could be overrepresented in the sample by virtue of their ability to survive a heart attack and thus become available for an interview.
To illuminate the difficulty, let's look at a couple of examples - one real, the other hypothetical. If you looked at the age at death of deceased rock-and-roll stars (Buddy Holly, Jimi Hendrix, Janis Joplin, Jim Morrison, Kurt Cobain, et al.), you might superficially conclude that rock stars die about 40 years younger than the general population. This interpretation is invalid, though, because the sample is biased, systematically excluding those icons of rock who are still alive; for all we know, Mick Jagger might live to be 90. The same problem afflicts analyses of angry Americans - the analyses are restricted to those among them who get heart attacks.
More hypothetically, suppose that disease X, if untreated, is fatal 20 percent of the time. Now imagine that there is a widely used surgical procedure for this disease that kills 1 percent of the patients who undergo it but that cures the other 99 percent. Of those people whose deaths are attributable to disease X, an awful lot will have spent their final hours on the operating table. Viewed in isolation, this might suggest that the surgery is highly dangerous. But in neglecting to look at the 99 percent who were cured, this "last hour before death" analysis totally ignores the benefits of the procedure. In the same way, a study that limits its purview to known victims of heart attacks obscures any possible benefit to the heart of releasing tension through anger.
The mystery was solved in the fine print at the bottom of the ad, which revealed that the relevant survey was "conducted among Midway Metrolink passengers between New York and Chicago." In other words, the only passengers eligible for inclusion in the survey were the 8 percent who were already flying Midway. To treat the sample as representative, one would have to make the startling assumption that Midway's popularity among those who fly it was the same as among those who don't. If there was any surprise at all in the results, it was that one in six Midway passengers apparently preferred to be flying on a different airline.
Journalists sometimes attach great importance to random data shifts that may already be irrelevant by the time they are reported. Admittedly, it's not always easy to distinguish a mere fluctuation from the start of a meaningful trend. The effort to do so is worth making, however, and in some cases pays off quickly.
But because fatal air accidents involving U.S. jets are exceedingly rare, even airlines with the same safety record over the long run can differ in safety performance over short spans. Indeed, if a ranking of carriers by safety reflects mere fluctuations, it should be highly changeable as the observation period varies. As the table below shows, this is indeed the case. The table ranks the eight large U.S. jet carriers by the death risk for a person who randomly chose one of the airline's flights during 10-year periods ending in 1983, 1988, and 1993. The lower the numbers, the fewer the fatalities. (Airlines with no deaths at all during a period are starred; these are ranked by number of flights performed.)
To put it delicately, the results cannot be characterized as stable. The first-ranked airline was different in all three periods and, strikingly, the airline that was best in one period always fell in the bottom half of the rankings in the other two. Southwest Airlines had a perfect record over all three periods but, because it had far fewer flights than the other carriers, was in a better position than they to avoid fatalities. The two airlines that were ranked lowest in the two most recent periods (Northwest and USAir) had no passenger deaths at all in the third. The mortality data, in short, provide a pitifully tenuous basis for putting these airlines into two distinct categories - a point that was overlooked both by IAPA's analysts and by the newspapers that publicized their results.
Even if homicides against foreign tourists in Florida occur at a low, constant rate over time, there are bound to be some periods when the rare events bunch together, much as there will be other periods when none occur at all. Suppose, for example, that over many years there is on average a 1 percent chance each day that a foreign tourist will be murdered somewhere in Florida. Such killings will average 3.65 per year (365 x .01), and the average interval between successive killings will be 100 days - long enough, presumably, to dispel inclinations to speak of a trend. But probabilistic calculations (not included here) also show that, over a full decade, the chance is nearly 3 in 10 that there will be some 12-month period with 9 or more killings; over a 20-year period, the chance of such a bloody stretch rises to roughly 1 in 2.
In the six months following October 1993, the press fell silent on the subject of murders of foreigners in Florida. Conceivably, a menacing trend was reversed because of sensible measures it provoked, such as the elimination of visible evidence that a car is rented. But it is also quite possible that there was no real trend to reverse, and that the pattern no more signaled heightened danger to foreign tourists than a year without murders would have signaled a future free of risk.
Summary statistics about two large sets of data can invite conclusions that would not stand if the sets were examined individually, in greater detail. Comparisons of overall averages can yield particularly distorted impressions.
But some arithmetic raises doubts that the market's "invisible hand" was responsible for this sag. It seems reasonable to assume that perhaps 25 percent of the doctors practicing in 1970 (some 83,500) had retired by 1982, leaving about 250,000 at work. This means that roughly half the 480,000 doctors working in 1982 had begun practicing during the last 12 years. Because of this large influx, the typical physician in 1982 was probably younger than his or her 1970 counterpart. And since salaries tend to increase with age, the decline the magazine saw might well have reflected a downward shift in the age distribution among doctors rather than reduced compensation at any given age.
In fact, it is possible that the salaries of doctors in every age group actually went up during the period 1970-82, but that a dramatic downward trend in the age profile of physicians overall overshadowed this rise and pushed down the profession's average pay. Indeed, the minimal size of the reported drop in salary (4 percent) suggests that an age-by-age comparison might well have shown that doctors' annual pay was rising along with their numbers.
Each airline's on-time score depends on its performance ratings at the 30 individual airports, but the airports the airliners serve frequently have greater effect than those it serves rarely. The averages thus naturally favor an airline that mostly flies in and out of fair-weather airports over those airlines that serve cities frequently socked in by rain or fog.
For example, America West Airlines routinely outperforms Alaska Airlines in overall on-time performance, but on further inspection this victory seems hollow. Alaska serves only five of the thirty busiest airports and, as we can see from the following table, it was prompter than America West in June 1991 at all five. But if one computes the average performance for flights into those five airports, America West receives a better rating. This counterintuitive result arises because a large majority (73 percent) of America West's flights into these five airports arrive at desert-sun Phoenix. Thus, America West's 92.1 percent on-time record at Phoenix dominates its five-airport statistic. Alaska Airlines scored even better in Phoenix than America West did (94.8 percent on time), but because only 6 percent of Alaska Airlines's flights go into or out of Phoenix, this result has little effect on its five-city average. By contrast, 57 percent of Alaska's flights arrive at Seattle - one of the moody weather capitals of the world - as opposed to only 4 percent of America West's. In the five-city average, in other words, America West gets to put its best foot forward and bury one of its weakest scores; Alaska Airlines is forced into the opposite position.
Fundamental misunderstandings of statistical results can arise when two words or phrases are unwisely viewed as synonyms, or when an analyst applies a particular term inconsistently.
The Supreme Court understood the study the same way. Its majority opinion noted that "even after taking account of 39 nonracial variables, defendants charged with killing white victims were 4.3 times as likely to receive a death sentence as defendants charged with killing blacks."
But the Supreme Court, the New York Times, and countless other newspapers and commentators were laboring under a major misconception. In fact, the statistical study in McClesky v. Kemp never reached the "factor of four" conclusion so widely attributed to it. What the analyst did conclude was that the odds of a death sentence in a white-victim case were 4.3 times the odds in a black-victim case. The difference between "likelihood" and "odds" (defined as the likelihood that an event will happen divided by the likelihood that it will not) might seem like a semantic quibble, but it is of major importance in understanding the results.
The likelihood, or probability, of drawing a diamond from a deck of cards, for instance, is 1 in 4, or 0.25. The odds are, by definition, 0.25/0.75, or 0.33. Now consider the likelihood of drawing any red card (heart or diamond) from the deck. This probability is 0.5, which corresponds to an odds ratio of 0.5/0.5, or 1.0. In other words, a doubling of probability from 0.25 to 0.5 results in a tripling of the odds.
The death penalty analysis suffered from a similar, but much more serious, distortion. Consider an extremely aggravated homicide, such as the torture and killing of a kidnapped stranger by a prison escapee. Represent as PW the probability that a guilty defendant would be sentenced to death if the victim were white, and as PB the probability that the defendant would receive the death sentence if the victim were black. Under the "4.3 times as likely" interpretation of the study, the two values would be related by the equation:
If, in this extreme killing, the probability of a death sentence is very high, such that PW = 0.99 (that is, 99 percent), then it would follow that PB = 0.99/4.3 = 0.23. In other words, even the hideous murder of a black would be unlikely to evoke a death sentence. Such a disparity would rightly be considered extremely troubling.
But under the "4.3 times the odds" rule that reflects the study's actual findings, the discrepancy between PW and PB would be far less alarming. This yields the equation:
If PW = 0.99, the odds ratio in a white-victim case is 0.99/0.01; in other words, a death sentence is 99 times as likely as the alternative. But even after being cut by a factor of 4.3, the odds ratio in the case of a black victim would take the revised value of 99/4.3 = 23, meaning that the perpetrator would be 23 times as likely as not to be sentenced to death. That is:
Work out the algebra and you find that PB = 0.96. In other words, while a death sentence is almost inevitable when the murder victim is white, it is also so when the victim is black - a result that few readers of the "four times as likely" statistic would infer. While not all Georgia killings are so aggravated that PW = 0.99, the quoted study found that the heavy majority of capital verdicts came up in circumstances when PW, and thus PB, is very high.
None of this is to deny that there is some evidence of race-of-victim disparity in sentencing. The point is that the improper interchange of two apparently similar words greatly exaggerated the general understanding of the degree of disparity. Blame for the confusion should presumably be shared by the judges and the journalists who made the mistake and the researchers who did too little to prevent it.
(Despite its uncritical acceptance of an overstated racial disparity, the Supreme Court's McClesky v. Kemp decision upheld Georgia's death penalty. The court concluded that a defendant must show race prejudice in his or her own case to have the death sentence countermanded as discriminatory.)
Fortunately, the debris fell harmlessly in a remote part of Australia. But the lesson is that an elusive word like "someone" is not useful in describing an event. When a word can be construed in different ways, the reader and even the data analyst can unintentionally jump from one interpretation to another, as presumably NASA did when it first equated "someone" to "at least one" but then shifted to "exactly one" in the middle of its calculation.
Press accounts of scientific studies sometimes invite readers to reach conclusions by comparing a reported statistic with some other that supposedly represents a natural baseline. But the proposed baseline may be anything but natural.
But the magazine advanced this point with incomplete evidence. In fact, it was perfectly conceivable that the top 10 percent was paying a growing share of the nation's taxes simply because this group's share of the nation's income was going up. In this particular case, the faulty assumption was not fatal because the unchecked data about earnings among the wealthy supported the story's claim: the share of income amassed by the wealthiest 10 percent of Americans changed very little from 1975 to 1980. But the glib comparison between the two years was unsound, and invoking the same "top tenth" argument for the 1980s - when the Census Bureau reports that U.S. income inequality did indeed rise sharply - would produce a quite misleading result.
Questionable analyses in the first five categories can be spotted by anyone with a little knowledge of statistical methods. A more insidious type of misinformation unravels only when a reader probes the numbers and looks at their source.
A brochure put out separately by the manufacturer of this test kit, however, was extremely disturbing. It showed that the 99.5 percent estimate was based on data summarized in the table that follows. The table does indicate only 1 error in 200 assessments, but it raises two questions. Why were 99 percent of the women tested - 198 out of 200 - pregnant? And, even more strangely, why was the accuracy of the test for nonpregnant women estimated from a sample size of two?
Things got worse as the brochure went on. The 2-for-2 accuracy statistic about nonpregnant women was based on an analysis of the test results by laboratory technicians. But the main advantage of a home pregnancy test is that women can use it themselves. The brochure took account of this issue by reporting what happened when the women interpreted results on their own: of 101 such women who were not pregnant, 8 mistakenly concluded that they were.
In other words, the manufacturer had two accuracy results about nonpregnant women. One, based on a (presumably) representative sample of the product's users, showed an error rate of 8 percent in 101 trials. The other, based on a "sample" of laboratory technicians, obtained a 0 percent error rate over 2 trials. In its advertising, the manufacturer applied the 0 percent rate in the small expert sample and ignored the 8 percent rate in the large, unbiased one.
The analysis began with the overall death rate per mile driven on rural interstate highways, the main thoroughfares for intercity auto trips. The researchers then revised this initial risk estimate using multipliers that reflected various characteristics of cars and drivers. Having a heavier-than-average car multiplied the risk estimate by 0.77 (that is, reduced it by 23 percent), while having a 40-year-old driver multiplied the estimate by 0.68. The final risk factor for a particular combination of factors was the product of the individual adjustments.
Unfortunately for those who prefer to drive, this analysis greatly exaggerates the safety of driving because the risk-reduction factors are not truly independent: Part of the reason 40-year-olds die less frequently in car crashes than 18-year-olds is that the middle-aged motorists tend to drive heavier cars, wear seat belts, and stay off the road when intoxicated. Taking credit for each of these factors separately, as the study did, amounts to quadruple-counting and greatly overstates the safety of driving versus flying.
The study exacerbated this error by failing to distinguish between the safety records of different types of aircraft. In their risk calculations for 600-mile flights, the researchers worked with merged accident data for all types of aircraft. But a flight of 600 miles is almost always performed by a jet, and jets have far better safety records than propeller planes. The peculiar approximations of this study led it to conclude that the mortality risk from driving 600 miles was comparable to that of flying 600 miles. A more fair and logical analysis would show that flying is safer by a factor of at least five.
The most cautious general course for the reader is to treat such reports more as public announcements that studies have been done than as clear guides to their content or reliability. Readers might not only look for evidence that researchers, reporters, or advertisers have committed one or more of the six deadly sins but also cultivate a general awareness that statistics can yield highly divergent interpretations. When a particular interpretation of the reported data pattern is advanced, have the analysts reasonably excluded other possibilities, or failed even to recognize them?
Ultimately, should the conclusions really matter to the reader, then there is no avoiding the arduous task of finding the study and reading it. And contacting the author for further details is both wise and legitimate.
For the alert individual, statistical humbug should be no harder to ferret out than other forms of illogical argument. It just takes practice and time.
Published eight times each year by the Association of Alumni and Alumnae of the Massachusettes Institute of Technology. The editors seek diverse views, and authors' opinions do not represent the official policies of their institutions or those of MIT.
Articles may not under any circumstances be resold or redistributed for compensation of any kind without prior written permission from Technology Review