skip the i-GuideIllinois State UniversityAdmissions at ISUAcademics at ISUEvents at ISUMap of ISUISU A to Z ListingISU AccessibilityISU 150th Anniversary
Web Support Services's eStoryboard

Interpreting the Numbers 

Statistical Fallacies.

Arguments that draw conclusions from numerical evidence are subject to a variety of statistical fallacies. Many of these are a variant on the general argumentation fallacy of mistaking correlation as causation.  Among the most common fallacies are:

Reversing causal direction.  (Post hoc ergo propter hoc)A relationship between two variables in and of itself does not reveal which might be the cause of the other, even if the variables are measured at different points in time.  Roosters do not cause the sun to rise.

In the U.S., states with the highest rates of political corruption tend have low rates of voter turnout (note: check the validity of the measure of political corruption).  Does the low voter turnout foster greater corruption or vice versa?

Similarly, countries with corrupt regimes tend to have higher poverty rates, but there is considerable debate over what causes what.

Simpson's paradox That some third variable might account for an observed relationship is the reason we have social scientists.  One form of a spurious relationship is a Simpson's paradox, where a relationship reverse itself when the data are broken down by demographic categories.

In the case of the relationship at left, the city of Chicago has a lower pass rate on the 8th grade math test than the state as a whole (note: Chicago is included in the state totals), but for every demographic group of students, Chicago outperforms the rest of the state. 

Here, Gerald Bracey describes a Simpson's paradox involving SAT scores.

Cherry picking.Biased selection of social indicator data to support preconceived ideas is probably the most common statistical fallacy in public debate, a phenomenon fostered by the increasing availability of alternative social indicator measurements, especially annual time series data that permits the research to choose beginning and ending points for comparison.

One brazen example of cherry picking was talk show host Bill O’Reilly’s attempt to argue that the Bush administration deserved credit for lowering the poverty rate.  “The only fair comparison,” of poverty rates, O’Reilly insisted, “is halfway through Clinton's term, halfway through Bush's term”(Media Matters, 2005).  O’Reilly was correct, sort of, but failed to see that a measure of the change in the poverty rate over the first four years of the presidents’ terms would tell an entirely different story.

Edward Tufte is at his best in condemning the impact of cherry picking and related corruptions on our intellectual life.  Cherry picking, he says, is “most serious threat to learning the truth from an evidence-based report (2006, 144). He cites evidence of the disproportionately high ratio of number of published studies that report relationship that are just barely statistically significant to the number that are barely insignificant.  Most likely this is caused by researchers who fiddle with their regression equations until they get the results they want.

For the most part cherry picking is an unintentional process: politicians and policy advocates readily embrace data that confirms their preconceived ideas and rigorously evidence that does not fit with their view of how the world works.


How President Bush lower the poverty rate

CALLER: Hi, Bill.
O'REILLY: Larry.
CALLER: Let's see, poverty is up since Bush took office.
O'REILLY: That's not true.
CALLER: It is true.
O'REILLY: I have the stats right here, Larry.
CALLER: I just looked at the figures. Gun crime is up since George Bush took office.
O'REILLY: All right, Larry, hold it, hold it, hold it. Let's deal with one at a time. The only fair comparison is halfway through Clinton's term, halfway through Bush's term, OK? That's the only fair comparison. You gotta go real time.
CALLER: Bill, I --O'REILLY: Poverty is down, Larry, one full percent in real time from 1996, halfway through Clinton, 2004, halfway through Bush. That is the truth, Larry, and if you're not willing to acknowledge that's the truth, this conversation is over.  source

Instrumentation. Instrumentation error suggests that the observed numerical comparison may be due to measurement unreliability.  

 Throughout the 1970s and 80s, the FBI measure of violent crime, based on police reports of violent crime, increased, while the National Crime Victimization Survey, based on an annual survey of personal crime victimization, indicated that the rate was falling.  The FBI measure is generally regarded as less reliable, however, as reporting of crimes to police has increased and police departments have improved their record keeping.  Oddly, it is because the FBI data have become more reliable over time that time series analysis of the data has become suspect.

Instrumentation can also affect conclusions drawn from cross sectional numerical comparisons.  The high U.S. infant mortality rates, often cited as a product of the lack of universal health insurance, may also be at least partly due to the way the U.S. counts live births.

Sampling error statistics and tests of statistical significance measure a form of measurement unreliability: the probability of error in the measurement of a single statistic and the likelihood that a numerical difference due to sample size.  For the most part, however, error due to small sample is the least consequential aspect of measure error.  Because most social indicator analysis does not include analysis of sampling error or statistical significance, two common statistical fallacies are avoided: concluding from a significant relationship that a meaningful relationship exists and concluding from an insignificant relationship that no relationship exists.

8Rate of Change Fallacy (the "Politicians Error"). The rate of change fallacy occurs when comparing rates of change (usually rates of changes in rates) in two numbers that start out at different levels.  The misinterpretation is so common in the interpretation of educational statistics [often involving conclusions that disadvantaged kids are improving at a fast rate] that one educational researcher, Stephen Gorard (1999) has given it a name: the “politician’s error”.

In 2006, Chicago public school officials trumpeted the apparent gain in student scores on the annual ISAT tests mandated by the No Child Left Behind law.  The state pass rate on the exams had increased 8 points –most probably because of revisions to the test and changes passing score for the 8th grade math tests –while the Chicago pass rate increased 14 percentage points.

But is an increase from a 48 to a 62% pass rate a bigger change than an increase from 69 to 77%?  Consider a more extreme example: would a student who increased his test score from 40 to 64 be improving at a faster rate than one who increased his score from 91 to 99?

One way of testing for the rate of change fallacy is to take the inverse of the data: using the failure rate rather than the pass rate. In the case of the Illinois data, both state and the city have seen similar percentage declines in their students’ rate of failure.

7 Interpreting the changes in black and white out-of-wedlock birthrates, Andrew Hacker concludes that “even though the number of births to unwed black women has ascended to an all-time high, white births outside of marriage have been climbing at an even faster rate” (86).

In one sense Hacker is correct, the black rate is four times higher in 1992 than it was in 1950 and the white rate is ten times higher.  That something is going wrong here becomes clearer when you consider what would have happened if the black rate had increased as fast as the white rate.

Another way of looking at this data, shown in table 1.5B, is to consider the in-wedlock birth rate instead of the out-of-wedlock birth rate. Had Hacker used these data, the reciprocal of his own numbers, he would have to draw the opposite conclusion: the black in-wedlock birth rate is falling much faster than the white in-wedlock rate. Hacker’s conclusion is not wrong so much as it is incomplete and misleading.

A recent study argues that high achieving students have been left behind by the No  Child Left Behind law (Duffett, Farkas, and Loveless, 2008).  On the 4th grade reading test, the average scale cores for students in the lowest decile have improved from 157 to 173, while scores for students in the top decile have gone from 260 to 263.  On the 8th grade math test, the lower decile score went up 13 points, while the upper decile increased 5.  Because of the way the National Assessment of Educational Progress calculates the scale scores (there is no meaningful maximum score on the test) there is no simple way of determining whether the gains for the lower decile are “bigger” than the gains for the upper deciles.

The report is also guilty of cherry picking. Note the oddity of choosing fourth grade reading and eighth grade math.  The National Assessment of Educational Progress tests reading and math in both grades and has other subject matter tests for these grades and grade 12.  Of ten possible subject-grade comparisons, none show as a great a discrepancy in decile score change as do the two comparisons selected.

Ecological FallacyThe ecological fallacy occurs when drawing a conclusion about individuals from aggregate data. It is related to the logical fallacy of division. States with higher per capita incomes generally have lower rates of homeownership. The fallacy here would be to conclude that wealthier individuals are less likely to own homes.
A recent study (Galbraith and Travis Hale, 2006) finds that the Democratic party receives a higher share of the presidential vote in wealthier states and states with greater income inequality. It would be an ecological fallacy to conclude from the first finding that wealthier voters vote Democratic or from the second finding that the rich and poor voters voter Democratic.

Sociologist William S. Robinson coined the term in a 1950 article in which he observed that states with the highest rates of foreign born population also had the highest literacy rates, even though the foreign born had lower literacy rates than the native born population.

6

Measurement Validity. Measurement Validity –how well a measurement measures the concept of interest –falls under Campbell’s threats to external validity and is a form of logical fallacy of hasty generalization.

When voter turnout, traditionally measured as the percentage of the voting age population that votes, fell below 50 percent, in the 1996 election political commentators blamed turned-off voters, partisan politics (the only kind of politics), negative campaigning and the rise of conservative talk radio. 

In 2001, however, political scientist Michael McDonald compiled new data suggesting that the talk about the vanishing American voter was “a myth” (2001, 963). McDonald’s analysis called attention to the denominator in the voting turnout statistic: voting age population, and argued that we should instead use the voting-eligible population. Over recent elections, an increasing percentage of the American voting age population has not been eligible to vote. Mostly this is because of increasing immigration: both legal and non-legal non citizen residents are counted in the Census Bureau voting age population figures. In addition, in all but two states, prisoners are not allowed to vote and in 12 states even ex-felons are disenfranchised. Because the percentage of the American population that either is incarcerated or has ex-felon status has gone up dramatically since the 1980s, an increasing percentage of the voting age population cannot vote. Taking the votes cast as a percentage of the voting eligible population (the “VC/VEP” trend in figure 4.6) as our measure of turnout, we see no general decline in voter turnout since 1972, when 18 years olds were given the franchise.

So which is the better measure of voter turnout? If you look at the voter turnout as a measure of how democratic a society is, the traditional voting-age numbers have greater validity. Although voting-eligible turnout is increasing and at a long time high, this is true because so many young black males (unlikely voters to begin with) have been put in jail and, in many states, denied the right to vote for the rest of their lives and because so many of our nation’s poor are not citizens. If all young voters (the age group least likely to vote) were incarcerated and all the poor (the economic group least likely to vote) were declared non-citizens, the American voter turnout rate would be among the highest in the world, but the United States would not be a more democratic society.

Population\Sample mortality.  Although sample mortality is usually not a concern with social indicator measurements, because they do not involve repeated measurements of the same sample, changes and differences in the inclusiveness of the populations surveyed can have significant effects on data comparisons.  This is especially true in the case of measures of educational achievement where, sometimes deliberately, the students who are least likely to do well on the tests are often excluded from the testing.

In 1983, the “A Nation at Risk” report began with disturbing evidence of the weak performance of American students on international academic achievement tests (National Commission).  Among the studies cited, American high school seniors recorded the lowest grades on the First International Science Study, but at least part of the reason for the low U.S. scores had to do with the relatively high rates of U.S. students completing high school (Medrich and Griffith,1992). 

In some situations there may also be the reverse population mortality effect.  Although black 4th grade and 8th grade reading scores have improved in recent years, 12th grade scores for black students have not.  Part of the reason the 12th grade scores have not gone up, however, may be due to the decline in the black high school dropout rate (Klass, 2008, 108).

Regression artifact. Measuring before-and-after change from a base year with unusually low or high values risks a regression artifact fallacy as the indicator “naturally” regresses to the mean. 

Advocates of the “Laffer Curve” –the idea that cutting taxes will increase government revenue –often cite the beneficial impact of the Reagan administration tax cuts that were partially implemented in the 1982 fiscal year and fully implemented in 1983. Heritage Foundation economist Daniel Mitchell (2003) argues the point: “Once the economy received an unambiguous tax cut in January 1983, income tax revenues climbed dramatically, increasing by more than …28 percent after adjusting for inflation.”   Note, however, that Mitchell begins his calculation in 1983, near the bottom of the Reagan recession.

Mitchell is also guilty of cherry picking.  Instead of a before-and-after measurement, he has done an after-and- long-after measurement – ignoring the two years of income tax revenue reductions that took place while the tax cuts were at least partially in effect and the higher rates of revenue growth that took place in the Carter years. 

When oil prices top four dollars a gallon in the Summer of 2008, Congress was quick to identify the culprit: oil price speculators who had bid up the price of oil to nearly $140 a barrel (the price does not even include the cost of the barrel).

If indeed the price of oil was driven up by a speculative bubble, it was a bubble that would eventually burst on its own. At the time Senate hearings were held in June, it was just a matter of time whether a short term correction in the bubble would be credited to the hearings or the actual enactment of the bill.

In truth speculators are to blame for the price rise. They are speculating that Congress will do nothing to increase the supply or reduce the demand for energy and that the Congress’s budget deficits will continue to increase, driving down the value of the dollar. As Congress spends its time passing legislation such as this, it is wise speculation.

The Trend is not your Friend. Wise investors know this but when things are looking up it is easy to forget.

It seemed to be a good time to buy a home in 2006 (and in 2000).  Adjustable no-interest, no down payment, loans were cheap and with rising home prices the prospects of getting another loan before the rate-adjustment kicked in were good.  Some, but not enough, of those who profited off the expectation that the trend would continue upward  are now on their way to jail.

For the one most interesting academic debates about allegedly dubious trend forecasting see Julian Simon’s critique of the Club of Rome’s 1972 Limits to Growth report that forecast the exhaustion of much of the world’s resources and an ensuing worldwide economic crisis in the 1980s.  Or read about Simon’s wager with Paul Erhlich, author of The Population Bomb. 

Unfortunately, discerning which trends are merely speculative bubbles soon to be corrected by market forces, and which are represent inexorable forces is no easy task.  The upward trend in university tuition in a latter example is of the inexorable variety.

Note:  read about Simon’s wager with Paul Erhlich, (environmentalists' interpretation | economists' interpretation) author of The Population Bomb.
 

Dominant Denominator. Most social indicators are ratio measures – Infant births as a percent of live births or traffic fatalities per million miles travelled – and occasionally analysts can be deceived by a change in an indicator that is more a function of a change in the denominator than in the numerator. This can be a particular problem in the case of divisors, such as GDP, that are more volatile than the indicator’s numerator.

In 2004, A Wall Street Journal editorial, criticized the Clinton administration for cuts in the defense budget: “Bill Clinton and a GOP Congress balanced the budget by withdrawing a "peace dividend" at a time when al Qaeda was declaring war” (2004).  Their evidence was a chart showing the declines in defense spending as a percent of GDP (Note the WSJ time series began in 1990, the ignoring earlier declines in the measure).  From fiscal year 1993, the budget year before Clinton took office, to 2001 (the year before the 9/11 increases) the defense spending fell from 4.4% of GDP to 3 %, a dramatic and steady decline.  Much of the decline, however, was due less to cuts in military spending (already underway after the end of the Cold War) and more to the dramatic economic growth and the increase in GDP in the Clinton years. In real dollars (adjust for the GDP price deflator), military spending actually increased in Clinton’s second term, a dramatic turnaround from the post Cold War decline that began in the first Bush administration.

Similarly, throughout the 1990s Ireland report dramatic reductions in most categories of governmental revenues and outlays as a percent of GDP, largely due to the dramatic improvement in the country’s GDP.

A general inattention to denominators, other than the basic per capita measure, is the cause for much statistical misinformation.  The most commonly used adjustment for inflation, the consumer price index, overestimates inflation by an estimated one percent per year (BLS, 2007).  As a result, measures of income and price growth are underestimated (most monetary measures –including income, spending, and prices are actually growing faster than the inflation adjusted measure would indicate).   Because the CPI results in an artificially higher poverty threshold, the overestimates has the opposite effect on poverty, making it appear that poverty is higher over time.

5Graphical distortion.

The side by side charts used in a Wall Street Journal editorial, "No Politician Left Behind
Lack of money isn't the problem with education," is a classic example of data distortion.  Note first that the data on the spending is is not adjusted for inflation or, the growth in the number of pupils.  In theory, 500 is the maximum score on the NAEP scale-scored math tests, but no student ever reaches this standard.  The average score for high school seniors on the same scale is just over 300.

Including some more recent data, and adjusting the reading score scale, we get quite a different picture:
 

341

Here is the fairest comparison, using 8th-grade data:

2

Hawthorne effect

De Moivre's paradox

Data Interpretation Quiz

Interpreting the Numbers: