How big a difference in mean CRT scores is

The nerve of it is that “rejection of the null [however it is arbitrarily defined] at p < 0.05 [or p < 10^-50 or whatever]” furnishes no inferentially relevant information in hypothesis testing. To know whether an observation counts as evidence in support of a hypothesis, the relevant information is not how likely we were to observe a particular value if the “null” is true but how much more or less likely we were to observe that value if a particular hypothesized true “value” is correct than if another hypothesized “true” value is correct (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999a; Gigerenzer 2004).

Actually, I’m not sure when the first formulation of the critique appeared. Amusingly, in his 1960 classic The Fallacy of the Null-hypothesis Significance Test , Rosenbloom, apologetically characterized his own incisive attack on the inferential barrenness of NHT as “not a particularly original view”!

The critique has been refined and elaborated many times, in very useful ways, since then, too. Weirdly, the occasion for so many insightful elaborations has been the persistence of NHT despite the irrefutable proofs of those critiquing it.

More on that in in a bit, but probably the most interesting thing that has happened in the career of the critique in the last 50 yrs. or so has been the project to devise tractable alternatives to NHT that really do quantify the evidentiary weight of any particular set of data.

I’m certainly not qualified to offer a reliable account of the intellectual history of using Bayesian likelihood ratios as a test statistic in the social sciences (cf. Good. But the utlity of this strategy was clearly recognized by Rozeboom, who observed that the inferential defects in NHT could readily be repaired by analytical tools forged in the kiln of “the classic theory inverse probabilities.”

The “Bayes Factor” –actually “the” misleadingly implies that there is only one variant of it—is the most muscular, deeply theorized version of the strategy.

But one can, I believe, still get a lot of mileage out of less technically elaborate analytical strategies using likelihood ratios to assess the weight of the evidence in one’s data (e.g., Goodman, 1999b).

For many purposes, I think, the value of using Bayesian likelihood ratios is largely heuristic: having to specify the predictions that opposing plausible hypotheses would generate with respect to the data, and to formulate an explicit measure of the relative consistency of the observed outcome with each, forces the researcher to do what the dominance of NHT facilitates the evasion of: the reporting of information that enables a reflective person to draw an inference about the weight of the evidence in relation to competing explanations of the dynamic at issue.

That’s all that’s usually required for others to genuinely learn from and critically appraise a researcher’s work. For sure there are times when everything turns on how precisely one is able to estimate some quantity of interest, where key conceptual issues about how to specify one or another parameter of a Bayes Factor will have huge consequence for interpretation of the data.

But in lots of experimental models, particularly in social psychology, it’s enough to be able to say “yup, that evidence is definitely more consistent—way more consistent—with what we’d expect to see if H1 rather than H2 is true”—or instead, “wait a sec, that result is not really any more supportive of that hypothesis than this one!” In which case, a fairly straightforward likelihood ratio analysis can, I think, add a lot, and even more importantly avoid a lot of the inferential errors that accompany permitting authors to report “p < 0.05” and then make sweeping, unqualified statements not supported by their data.

That’s exactly the misadventure, I said “yesterday,” that a smart researcher experienced with NHT. That researcher found a “statistically significant” correlation (i.e., rejection of the “null at p<0.0xxx”) between a sample of Univ of Ky undergraduate’s CRT scores (Frederick 2005) and their responses to a standard polling question on “belief in” evolution; he then treated that as corroboration of his hypothesis that “individuals who are better able to analytically control their thoughts are more likely” to overcome the intuitive attraction of the idea that “living things, are … intentionally designed by some external agent” to serve some “function and purpose,” and thus “more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species.”

But as I pointed out, the author’s data, contrary to his assertion, unambiguously didn’t support that hypothesis.

Rather than showing that “analytic thinking consistently predicts endorsement of evolution,” his data demonstrated that knowing the study subjects’ CRT scores furnished absolutely no predictive insight into their “evolution beliefs.” The CRT predictor in the author’s regression model was “statistically significant” (p < 0.01), but was way too small in size to outperform a “model” that simply predicted “everyone” in the author’s sample—regardless of their CRT score—rejected science’s account of the natural history of human beings.

(Actually, there were even more serious—or maybe just more interesting—problems having to do with the author’s failure to test the data’s relative support for a genuine alternative about how cognitive reflection relates to “beliefs” in evolution: by magnifying the opposing positions of groups for whom “evolution beliefs” have become (sadly, pointlessly, needlessly) identity defining. But I focused “yesterday” on this one b/c it so nicely illustrates the NHT fallacy.)

Had he asked the question that his p-value necessarily doesn’t address—how much more consistent is the data with one hypothesis than another—he would have actually found out that the results of his study was more consistent with the hypothesis that “cognitive reflection makes no goddam difference” in what people say when they answer a standard “belief in evolution” survey item of the sort administered by Gallup or Pew.

The question I ended on, then, was,

How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don’t ?

That’s a very complicated and interesting question, and so now I’ll offer my own answer, one that uses the inference-disciplining heuristic of forming a Bayesian likelihood ratio.

1. Using a Baysian likelihood ratio is not, in my view, the only device that can be used to extract from data like these the information necessary to form cogent inferences about the support fo the data for study hypotheses. Anything that helps the analyst and reader guage the relative support of the data for the study hypothesis in relation to a meaningful or set of meaningful alternatives can do that.

Often it will be *obvious* how the data do that, given the sign of the value observed in the data or the size of it in relation to what common understanding tells one the competing hypotheses would predict.

Motivated Reasoning

How big a difference in mean CRT scores is "big enough" to matter? or NHT: A malignant craft norm, part 2 - Cultural Cognition of Health

Key Insight