So, as I promised “yesterday,” here are some additional reflections on the deficiencies of “null hypothesis testing” (NHT).
Actually, my objection is to the convention of permitting researchers to treat “rejection of the null, p < 0.05” as evidence for crediting their study hypotheses.
In one fit-statistic variation or another, “p < 0.5” is the modal “reported result” in social science research.
But the idea that a p-value supports any inference from the data is an out-and-out fallacy of the rankest sort!
Because of measurement error, any value will have some finite probability of being observed whatever the “true” value of the quantity being measured happens to be. Nothing at all follows from learning that the probability of obtaining the precise value observed in a particular sample was less 5%—or even less than 1% or less than 0.000000001%—on the assumption that true value is zero or any other particular quantity.
What matters is how much more or less likely the observed result is in relation to one hypothesized true value than another. From that information, we can determine the inferential significance of the data: that is, we can determine whether the data support a particular hypothesis, and if so, how strongly. But if we don’t have that information at our disposal and a researcher doesn’t supply it, then anything the researcher says about his or her data is literally meaningless.
This is likely to seem obvious to most of the 14 billion readers of this blog. It is--thanks to a succession of super smart people who've helped to spell out this "NHT fallacy" critique (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999, 1999; Gigerenzer 2004).
As these critics note, though, the problem with NHT is that it supplies a mechanical testing protocol that elides these basic points. Researchers who follow the protocol can appear to be furnishing us with meaningful information even if they are not.
Or worse, they can declare that a result that is “significant at p < 0.05” supports all manner of conclusions that it just doesn’t support—because as improbable as it might have been that the reported result would be observed if the “true” value were zero, the probability of observing such a result if the researcher’s hypothesis were true is even smaller.
2. This straw man has legs
I know: you think I’m attacking a straw man.
I might be. But that straw man publishes a lot of studies. Let me show you an example.
In one recent paper--one reporting the collection of a trove of interesting data that definitely enrich scholarly discussion-- a researcher purported to test the “core hypothesis” that “analytic thinking promotes endorsement of evolution.”
That researcher, a very good scholar, reasoned that if this was so, “endorsement of evolution” ought to be correlated with “performance on an analytic thinking task.” The task he chose was the Cognitive Reflection Test (Frederick 2005), the leading measure of the capacity and motivation of individuals to use conscious, effortful “System 2” information processing rather than intuitive, affect-driven “System 1” processing.
After administering a survey to a sample of University of Kentucky undergraduates, the researcher reported finding the predicted correlation between the subjects' CRT scores and their responses to a survey item on beliefs in evolution (p < 0.01). He therefore concluded:
- "analytic thinking consistently predicts endorsement of evolution”;
- “individuals who are better able to analytically control their thoughts are more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species";
- “[the] results suggest that it does not take a great deal of analytic thinking to overcome creationist intuitions.”
If you are nodding your head at this point, you really shouldn’t be. This is not nearly enough information to know whether the author’s data support any of the inferences he draws.
In fact, they demonstrably don’t.
Here is a model in which belief in science's understanding of evolution (i.e., one that doesn't posit "any supreme being guid[ing] ... [it] for the purpose of creating humans') is regressed on the CRT scores of the student-sample respondents:
The outcome variable is the probability that a student will believe in evolution.
If, as the author concludes, “analytic thinking consistently predicts endorsement of evolution,” then we should be able to use this model to, well, predict whether subjects in the sample believe in evolution, or at least to predict that with a higher degree of accuracy than we would be able to without knowing the subjects’ CRT scores.
But we can’t.
Yes, just as the author reported, there is a positive & significant correlation coefficient for CRT.
But look at the "Count" & "Adjusted Count" R^2s.
The first reports the proportion of subjects whose “belief in evolution” was correctly predicted (based on whether the predicted probability for them was > or < 0.50): 62%.
That's exactly the proportion of the sample that reports not to believe in evolution.
As a result, the "adjusted count R^2" is "0.00." This statistic reflects the proportion of correct predictions the model makes in excess of the proportion one would have made by just predicting the most frequent outcome in the sample for all the cases.
Imagine a reasonably intelligent person were offered a prize for correctly “predicting” any study respondent’s “beliefs” knowing only that a majority of the sample purported not to accept science's account of the natural history of human beings. Obviously, she’d “predict” that any given student “disbelieves” in evolution. This “everyone disbelieves” model would have a predictive accuracy rate of 62% were it applied to the entire sample.
Knowing each respondent's CRT score would not enable that person to predict “beliefs” in evolution with any greater accuracy than that! The students’ CRT scores, in other words, are useless, predictively speaking.
Here's a classification table that helps us to see exactly what's happening:
The CRT predictor, despite being "positive" & "significant," is so weak that the regression model that included it just threw up its hands and defaulted to the "everyone disbelieves” strategy.
The reason the “significant” difference in the CRT scores of believers & nonbelievers in the sample doesn’t support the author's conclusion-- that “analytic thinking consistently predicts endorsement of evolution”--is that the size of the effect isn’t nearly as big as it would have to be to furnish actual evidence for his hypothesis (something that one can pretty well guess is the case by just looking at the raw data).
Indeed, as the analysis I’ve just done illustrates, the observed effect is actually more consistent with the prediction that “CRT makes no goddam difference” in what people say they believe about the natural history of human beings.
Why the hell (excuse my French) would we expect any other result? As I’ve pointed out 17,333,246 times, answers to this facile survey question do not reflect respondents' science comprehension; they express their cultural identity!
But that's not a very good reply. Empirical testing is all about looking for surprises, or at least holding oneself open to the possibility of being surprised by evidence that cuts against what one understands to be the truth.
That didn't happen, however, in this particular case.
Actually, I should point out the author constructs two separate models: one relating CRT to the probability that someone will believe in “young earth creationism” as opposed to “evolution according to a divine plan”—something akin to “intelligent design”; and another relating CRT to the probability that someone will believe in “young earth creationism” as opposed “evolution without any divine agency”—science’s position. It seems odd to me to do that, given that the author's theory was that “analytic thinking tends to reduce belief in supernatural agents.”
So my model just looks at see whether CRT scores predict someone believes in science’s-view of evolution—man evolves without any guidance form or plan by God—vs. belief in any alternative account. That’s why there is a tiny discrepancy between my logit model’s "odds ratio" coefficient for CRT (OR = 1.23, p < 0.01) and the author’s (OR = 1.28, p < 0.01).
But it doesn’t matter. The CRT scores are just as useless for predicting simply whether someone believes in “young earth” creationism versus either “intelligent design” or the modern synthesis. Thirty-three percent of the author’s Univ. Ky undergrad sample reported believing in “young earth creationism.” A model that regresses that “belief” on CRT classifies everyone in the sample as rejecting that position, and thus gets a predictive accuracy rate of 67%.
3. What’s the question?
So there you go: a snapshot of the pernicious vitality of the NHT fallacy in action. A researcher who has in fact collected some very interesting data announces empirical support for a bunch of conclusions that aren’t supported by them. What licenses him to do is a “statistically significant” difference between an observed result and a value—zero difference in mean CRT scores—that turns out to be way too small to support his hypothesis.
The relevant question, inferentially speaking, is,
How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don't?
That’s a super interesting problem, the sort one actually has use reflection to solve. It's one I hadn't thought hard enough about until engaging the author's interesting study results. I wish the author, a genuinely smart guy, had thought about it in analyzing his data.
I’ll give this problem a shot myself “tomorrow.”
For now, my point is simply that the convention of treating "p < 0.05" as evidence in support of a study hypothesis is what prevents researchers from figuring out what question they should actually be posing to their data.
Cohen, J. The Earth is Round (p < .05). Am Psychol 49, 997 - 1003 (1994).
Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research. Psych Rev 70, 193 - 242 (1963).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999).
Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med 130, 995 - 1004 (1999).
Rozeboom, W.W. The fallacy of the null-hypothesis significance test. Psychological bulletin 57, 416 (1960).