In a paper forthcoming in Nature Human Behavior (I think it is still “in press”), a large & distinguished group of social scientists propose nudging (shoving?) the traditional NHST threshold from p ≤ 0.05 to P ≤ 0.005. A response to the so-called “replication crisis,” this “simple step would immediately improve the reproducibility of scientific research in many fields,” the authors (all 72 of them!) write.
To disagree with a panel of experts this distinguished & this large is a daunting task. Nevertheless, I do disagree. Here’s why:
1. There is no reason to think a p-value of 0.005 would reduce the ratio of valid to invalid studies; it would just make all studies—good as well as bad—cost a hell of a lot more.
The only difference between a bad study at p ≤ 0.05 and a bad study at p ≤ 0.005 is sample size. The same for a good study in which p ≤ 0.005 rather than p ≤ 0.05.
What makes an empirical study “good” or “bad” is the quality of the inference strategy—i.e., the practical logic that connects measured observables to the not-directly observables of interest.
If a researcher can persuade reviewers to accept a goofy theory for a bad study (say, one on the impact of “himmicanes” on storm-evacuation advisories, the effect of ovulation on women’s voting behavior, or the influence of egalitarian sensibililties on the rate of altercations between economy class and business class airline passengers) at p ≤ 0.05, then the only thing that researcher has to do to get the study published at p ≤ 0.005 is collect more observations.
Of course, because sample recruitment is costly, forcing researchers to recruit massive samples will make it harder for researchers to generate bad studies.
But for the same reason, a p ≤ 0.005 standard will make it much harder for researchers doing good studies—ones that rest on plausible mechanisms—to generate publishable papers, too.
Accordingly, to believe that p ≤ 0.005 will improve the ratio of good studies to bad, one has to believe that scholars doing good studies will be more likely to get their hands on the necessary research funding than will scholars doing bad studies.
That’s not particularly plausible: if it were, then funders would be favoring good over bad research already—at p ≤ 0.05.
At the end of the day, a p ≤ 0.005 standard will simply reduce the stock of papers deemed publishable—period—with no meaningful impact on the overall quality of research.
2. It’s not the case that a p ≤ 0.005 standard will “dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible.”
The mistake here is to think that there will be fewer borderline studies at p ≤ 0.005 than at p ≤ 0.05.
P is a random variable. Thus, if one starts with a p ≤ 0.05 standard for publication, there is a 50% chance that a study finding that is “significant” at p = 0.05 will be “nonsignificant” at p = 0.05 on the next trial, even assuming both studies were conducted identically & flawlessly. (That so many replicators don’t seem to get this boggles one’s mind.)
If the industry norm is adjusted to p ≤ 0.005, we’ll simply see another random distribution of p values, now around the mean of p ≤ 0.005. So again, if a paper reports a finding at p = 0.005, there will be a 50% chance that the next, replication trial will produce a result that’s not significant at p < 0.005. . . .
Certifying reproducibility won’t be any “easier” or any more certain. And for the reasons stated above, there will be no more reason to assume that studies that either clear or just fall short of clearing the bar at p ≤ 0.005 are any more valid than ones that occupy the same position in relation to p < 0.05.
3. The problem of NHST cannot be fixed with more NHST.
Finally and most imporantly, the p ≤ 0.005 standard misdiagnoses the problem behind the replication crisis: the malignant craft norm of NHST.
Part of the malignancy is that mechanical rules like p ≤ 0.005 create a thought-free, “which button do I push” mentality: researchers expect publication for research findings that meet this standard whether or not the study is internally valid (i.e., goofy) . They don’t think about how much more probable a particular hypothesis is than is the null—or even whther the null is uniquely associated with some competing theory of the obsrved effect.
A practice that would tell us exactly those things is better not only substantively but also culturally, because it forces the researcher to think about exactly those things.
Ironically, it is clear that a substantial fraction of the “Gang of 72” believes that p-value–driven NHST should be abandoned in favor of some type of “weight of the evidence” measure, such as the Bayes Factor. They signed on to the article, apparently, because they believed, in effect, that ratcheting up (down?) the p-value norm would generate even more evidence of the defects of any sort of threshold for NHST, and thus contribute to more widespread appreciation of the advantages of a “weight of the evidence” alternative.
All I can say about that is that researchers have for decades understood the inferential barenness of p-values and advocated for one or another Bayesian alternative instead.
Their advocacy has gotten nowhere: we’ve lived through decades of defective null hypotheses testing and the response has always been “more of the same.”
What is the theory of disciplinary history that predicts a sudden radicalization of the “what button do I push” proletariat of social science?
As intriguing and well-intentioned the p ≤ 0.005 proposal is, arguments about standards aren’t going to break the NHST norm.
“It must get worse in order to get better” is no longer the right attitude.
Only demonstrating the superiority of a “weight of the evidence” alternative by doing it—and even more importantly teaching it to the next generation of social science researchers—can really be expected to initiate the revolution that the social sciences need.