## NHT: A malignant craft norm, part 1

So, as I promised “yesterday,” here are some additional reflections on the deficiencies of “null hypothesis testing” (NHT).

Actually, my objection is to the convention of permitting researchers to treat “rejection of the null, p < 0.05” as evidence for crediting their study hypotheses.

In one fit-statistic variation or another, “*p* < 0.5” is the modal “reported result” in social science research.

But the idea that a *p*-value supports any inference from the data is an *out-and-*out* fallacy* of the rankest sort!

Because of measurement error, *any* value will have some finite probability of being observed whatever the “true” value of the quantity being measured happens to be. Nothing at all follows from learning that the probability of obtaining the precise value observed in a particular sample was less 5%—or even less than 1% or less than 0.000000001%—on the assumption that true value is zero or any other particular quantity.

What matters is how much more or less likely the observed result is in relation to *one* hypothesized true value than another. From that information, we can determine the *inferential* significance of the data: that is, we can determine whether the data support a particular hypothesis, and if so, how strongly. But if we don’t have that information at our disposal and a researcher doesn’t supply it, then anything the researcher says about his or her data is literally meaningless.

This is likely to seem obvious to most of the 14 billion readers of this blog. It is--thanks to a succession of super smart people who've helped to spell out this "NHT fallacy" critique (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999, 1999; Gigerenzer 2004).

As these critics note, though, the problem with NHT is that it supplies a mechanical testing protocol that elides these basic points. Researchers who follow the protocol can appear to be furnishing us with meaningful information even if they are not.

Or worse, they can declare that a result that is “significant at *p* < 0.05” supports all manner of conclusions that it just doesn’t support—because as improbable as it might have been that the reported result would be observed if the “true” value were zero, the probability of observing such a result if the *researcher’s hypothesis* were true is even smaller.

**2. This straw man has legs**

I know: you think I’m attacking a straw man.

I might be. But that straw man publishes a lot of studies. Let me show you an example.

In one recent paper--one reporting the collection of a trove of interesting data that definitely enrich scholarly discussion-- a researcher purported to test the “core hypothesis” that “analytic thinking promotes endorsement of evolution.”

That researcher, a very good scholar, reasoned that if this was so, “endorsement of evolution” ought to be correlated with “performance on an analytic thinking task.” The task he chose was the Cognitive Reflection Test (Frederick 2005), the leading measure of the capacity and motivation of individuals to use conscious, effortful “System 2” information processing rather than intuitive, affect-driven “System 1” processing.

After administering a survey to a sample of University of Kentucky undergraduates, the researcher reported finding the predicted correlation between the subjects' CRT scores and their responses to a survey item on beliefs in evolution (*p* < 0.01). He therefore concluded:

- "analytic thinking consistently predicts endorsement of evolution”;

- “individuals who are better able to analytically control their thoughts are more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species";

- “[the] results suggest that it does not take a great deal of analytic thinking to overcome creationist intuitions.”

If you are nodding your head at this point, you really shouldn’t be. This is not nearly enough information to know whether the author’s data support any of the inferences he draws.

In fact, they demonstrably don’t.

Here is a model in which belief in science's understanding of evolution (i.e., one that doesn't posit "any supreme being guid[ing] ... [it] for the purpose of creating humans') is regressed on the CRT scores of the student-sample respondents:

The outcome variable is the *probability* that a student will believe in evolution.

If, as the author concludes, “analytic thinking consistently predicts endorsement of evolution,” then we should be able to use this model to, well,** predict** whether subjects in the sample believe in evolution, or at least to predict that with a higher degree of accuracy than we would be able to without knowing the subjects’ CRT scores.

But we can’t.

Yes, just as the author reported, there is a positive & significant correlation coefficient for CRT.

But look at the "Count" & "Adjusted Count" R^2s.

The first reports the proportion of subjects whose “belief in evolution” was correctly predicted (based on whether the predicted probability for them was > or < 0.50): 62%.

That's exactly the proportion of the sample that reports not to believe in evolution.

As a result, the "**adjusted count R^2**" **is "0.00." **This statistic reflects the proportion of correct predictions the model makes in *excess* of the proportion one would have made by just predicting the *most frequent outcome *in the sample for all the cases.

Imagine a reasonably intelligent person were offered a prize for correctly “predicting” any study respondent’s “beliefs” knowing only that a *majority* of the sample purported not to accept science's account of the natural history of human beings. Obviously, she’d “predict” that any given student “disbelieves” in evolution. This “everyone disbelieves” model would have a predictive accuracy rate of 62% were it applied to the entire sample.

*Knowing each respondent's CRT score would not enable that person to predict “beliefs” in evolution with any greater accuracy than that! * The students’ CRT scores, in other words, are useless, predictively speaking.

Here's a classification table that helps us to see exactly what's happening:

The CRT predictor, despite being "positive" & "significant," is so weak that the regression model that included it just threw up its hands and defaulted to the "everyone disbelieves” strategy.

The reason the “significant” difference in the CRT scores of believers & nonbelievers in the sample doesn’t support the author's conclusion-- that “analytic thinking consistently predicts endorsement of evolution”--is that the *size* of the effect isn’t nearly as big as it would have to be to furnish actual evidence for his hypothesis (something that one can pretty well guess is the case by just looking at the raw data).

Indeed, as the analysis I’ve just done illustrates, **the observed effect is actually more consistent with the prediction that “CRT makes no goddam difference” in what people say they believe** **about the natural history of human beings.**

Why the hell (excuse my French) would we expect any other result? As I’ve pointed out 17,333,246 times, answers to this facile survey question do not reflect respondents' science comprehension; they express their cultural identity!

But that's not a very good reply. Empirical testing is all about looking for surprises, or at least holding oneself open to the possibility of being surprised by evidence that cuts against what one understands to be the truth.

That didn't happen, however, in this particular case.

Actually, I should point out the author constructs two separate models: one relating CRT to the probability that someone will believe in “young earth creationism” as opposed to “evolution according to a divine plan”—something akin to “intelligent design”; and another relating CRT to the probability that someone will believe in “young earth creationism” as opposed “evolution *without* any divine agency”—science’s position. It seems odd to me to do that, given that the author's theory was that “analytic thinking tends to reduce belief in supernatural agents.”

So my model just looks at see whether CRT scores predict someone believes in science’s-view of evolution—man evolves without any guidance form or plan by God—vs. belief in any alternative account. That’s why there is a tiny discrepancy between my logit model’s "odds ratio" coefficient for CRT (OR = 1.23, p < 0.01) and the author’s (OR = 1.28, p < 0.01).

But it doesn’t matter. The CRT scores are just as useless for predicting simply whether someone believes in “young earth” creationism versus either “intelligent design” *or* the modern synthesis. Thirty-three percent of the author’s Univ. Ky undergrad sample reported believing in “young earth creationism.” A model that regresses that “belief” on CRT classifies everyone in the sample as *rejecting* that position, and thus gets a predictive accuracy rate of 67%.

**3. What’s the question?**** **

So there you go: a snapshot of the pernicious vitality of the NHT fallacy in action. A researcher who has in fact collected some very interesting data announces empirical support for a bunch of conclusions that aren’t supported by them. What licenses him to do is a “statistically significant” difference between an observed result and a value—zero difference in mean CRT scores—that turns out to be way too small to support his hypothesis.

The relevant question, inferentially speaking, is,

**How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don't?**

That’s a super interesting problem, the sort one actually has use reflection to solve. It's one I hadn't thought hard enough about until engaging the author's interesting study results. I wish the author, a genuinely smart guy, had thought about it in analyzing his data.

I’ll give this problem a shot myself “tomorrow.”

For now, my point is simply that the convention of treating "*p* < 0.05" as evidence in support of a study hypothesis is what prevents researchers from figuring out what question they should actually be posing to their data.

**REeferences**

Cohen, J. The Earth is Round (p < .05). *Am Psychol ***49**, 997 - 1003 (1994).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research. *Psych Rev ***70**, 193 - 242 (1963).

Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. *Annals of internal medicine ***130**, 1005-1013 (1999).

Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. *Ann Int Med ***130**, 995 - 1004 (1999).

Rozeboom, W.W. The fallacy of the null-hypothesis significance test. *Psychological bulletin ***57**, 416 (1960).

## Reader Comments (5)

"But the idea that a p-value supports any inference from the data is an out-and-out fallacy of the rankest sort!"Agreed. It has a somewhat different purpose.

The Bayesian approach asks for the likelihood ratio P(O|H) / P(O|~H), which requires the probability of the observation both with and without the hypothesised model being true. The probability given the hypothesis is usually fairly easy. But if you don't have a complete menu of options - if you've only got one idea and otherwise don't have a clue - how the heck can you calculate the probability of the observed observations under an unknown and completely undefined alternative hypothesis? All you've got is that it's "not H", and that's not helpful!

So the hypothesis testing approach assumes the existence of an alternative that predicts the outcome perfectly. Presumably, if you knew the location of every particle in the universe and had an infinitely powerful computer, you'd be able to do so, so such a hypothesis must exist. Your alternative hypothesis is therefore "The true physics, whatever that is".

So now P(O|~H) is well-defined, and equal to 1. The Bayesian recipe is complete.

It's a fudge and an approximation, to be used in the absence of better understanding. And of course, the likelihood ratio isn't determinative of the conclusion - that's the role of the posterior probability.

But it shouldn't be the role of reported experimental results to be drawing conclusions, anyway. Everyone has different priors, and you lose generality if you start imposing one particular set. The 0.05 p-value limit is simply an informal editorial threshold to limit the amount of spurious rubbish that gets published, while telling everybody about the more significant lumps of evidence. In fields that generate a lot more results (particle physics does computerised data dredges that can search for thousands of hypotheses) the threshold for publication is usually a lot higher. In other fields where there are very few ideas at all, a lower threshold might be appropriate.

I think a big problem here is a fundamental misunderstanding of what the peer-reviewed journals are actually

for. They are not, as some people seem to think, the outlet for "settled science". They are merely the bucket of work in progress where in evidence is accumulated or dismissed,on its way to becoming"settled science", to the extent that any science is.And yes, journal papers frequently ask the wrong questions. Journals are the arena in which wrong questions die, by being exposed to criticism and challenge.

So in my view it is a fallacy (but only a

slightlyrank one) to object to the 5% p-value threshold on the grounds that it allows dubious results through into the journals, because I consider that to be precisely the purpose of the journals. They are the place dubious results go to get cut down.I think the real fallacy you're talking about here is the

false dilemma: the assumption that if one of the two alternatives offered is false then the other must be true, when the offered alternatives are not known to be a complete list.Rejecting the null only rejects the null. It doesn't endorse any other specific alternative hypothesis unless an argument has been made that the alternative is the only other possibility.

There

isa common p-value fallacy - the belief that it indicates a strong posterior rather than a strong likelihood ratio - but I don't think that's what you're complaining about.But please feel free to argue if I've misunderstood.

@NiV:

I think we are on the same page

There are plenty of people, myself included, who have made moral and policy arguments on the basis of hypotheses in favor of which a null (and only a null) was rejected. Frankly, it seems to be the Western version of the Chinese index fallacy (by which I mean, I've seen a lot of Chinese environmental scientists make indexes to represent outcomes and then claim that the index moving in the right direction over time means that environmental outcomes are improving over time.)

NiV, your idea that P(O|~H) is being identified with 1 is strange to me, and I hope to clarify this point. Would you be so kind as to walk it through with me?

I thought the reason Bayesian statisticians didn't like p-value inference was the following: in p-value inference, we set up the straw man ~H = null. Bayesians look for a large likelihood ratio, P(O|H) / P(O|~H). Based on a small p-value (defined as the short tail of the cumulative distribution function evaluated at the point of observation. I have never known why we call by the name of "p-value" something that's an evaluation of the cumulative distribution function instead of the probability distribution function.), we reject the null. Inferring based upon "rejecting the null" is just an argument that because the p-value is small, the likelihood of observation is likely also small, and the likelihood ratio is likely large. These two "likely"s are both mathematically unsound, which makes the inference fallacial in Bayesian terms. First, P(O|~H) > the p-value for most common distributions. Second, you can't just not compute P(O|H), when P(O|H) < 1. To this, we're now adding, "Zeroth, the assertion [ ~H = null ] is already a straw man."

@dypoon & @NiV:--keep discussin! I am working on "bayesian anslysi"--in sense of dilignent honest transporent effor that is open to critical engagement w/ others who think there is better way, which I'm sure there is -- to measure how much more probable it is we'd observe this difference in mean CRT scores assuming that "there is no goddam difference" in evolution "believers'" & "disbelievers'" cognitive reflection level than that we'd see if if there was a "true" difference in reflect big enough to support any inference about contribution cognitive reflection could be making to their respective "beliefs."

Will post "tomorrow"...

"NiV, your idea that P(O|~H) is being identified with 1 is strange to me, and I hope to clarify this point. Would you be so kind as to walk it through with me?"Sure. Let's take a concrete example.

We consider the claim that the numbers 1.31, -6.27, 0.48, 0.23 come from a process that generates independent random numbers with a Gaussian N(0,1) distribution. What are our hypotheses? H is that the numbers are independently distributed N(0,1), ~H is that they are distributed according to a distribution that is something other than N(0,1). We can work out what P(O|H) is fairly easily, but what the heck is P(O|~H)?

(For simplicity, I'm going to ignore the point that the probability of any specific output from a Gaussian is always zero because the distribution is continuous, and assume you can insert the tedious technicalities for yourself. It's not relevant to the argument here.)

Should we consider, say, Gaussians with different means? How should we weight them? Consider the following argument: the mean can be any number from minus infinity to plus infinity, and almost all of the possibilities are unimaginably huge. The probability of the size of the mean being under 10^100 is infinitesimal. So what if we chose ~H to be the set of all possibilities N(m,1) with m uniformly weighted over the real numbers? Then P(O|~H) = 0. Expanding the possibilities for ~H (non-independent, non-Gaussian, non-random, ...) only makes things worse. We have no idea how to proceed.

Yet we would like to have some way of testing whether our given sequence is plausibly Gaussian N(0,1). In particular, we'd like to be able to say that the value -6.27 in the sequence is

pretty unlikelyunder the assumption, and therefore there is considerable evidenceagainstit being N(0,1). Granted, it's far more likely to be N(0,1) than something north of 10^100, but we can't help feeling that's somehow not a proper alternative to be considering.What would we

reallyconclude if we decide to reject H? We will presumably conclude that there'ssomepattern to the phenomenon - there aren't any real physical processes uniformly distributed over the reals - it's just unknown. And given that there is one, then the probability of the observations happening under thetruemechanism is some much higher number, far bigger than the infinitesimal. After all, it just happened.So now we come to one of those sticky philosophical questions: is the universe deterministic? Because if it is, then the internal details of the "random" process are themselves specified as part of the alternative hypothesis, and the probability of any outcome that actually happens is always 1. (This is perfectly OK in a Bayesian framework because probability is considered to be the result of ignorance rather than any fundamental physical indeterminism. How else could you talk about the "probability" of your opponent having an Ace when he can already see that he does?) A sufficiently detailed true hypothesis always exists where this is the case.

If our philosophy is that the universe is not deterministic, our justification is a little harder. The best I can offer is that we've just observed the given outcome in a sample size of 1, so it can't be

toosmall. Our best estimate of the probability is that it is 1, with some margin of error. (i.e. If we observe k successes out of n samples, our best estimate of the probability is k/n.) The argument is a bit iffy, but it sort of works.Or we can just say we don't know, anything we say will be wrong, and we pick 1 as a more

convenientwrong number than the alternatives.Some people do say that million-to-one shots come up nine times out of ten. Maybe they're more right than they know?

--

"Bayesians look for a large likelihood ratio,"Personally, I prefer to use log-likelihood. It's more intuitive in information theory terms.