How big a difference in mean CRT scores is "big enough" to matter? or NHT: A malignant craft norm, part 2

1. Now where was I . . . ?

Right . . . So yesterday I posted part I of this series, which is celebrating the bicentennial , or perhaps it’s the tricentennial—one loses track after a while–of the “NHT Fallacy” critique

The nerve of it is that “rejection of the null [however it is arbitrarily defined] at p < 0.05 [or p < 10^-50 or whatever]” furnishes no inferentially relevant information in hypothesis testing. To know whether an observation counts as evidence in support of a hypothesis, the relevant information is not how likely we were to observe a particular value if the “null” is true but how much more or less likely we were to observe that value if a particular hypothesized true “value” is correct than if another hypothesized “true” value is correct (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999a; Gigerenzer 2004).

Actually, I’m not sure when the first formulation of the critique appeared. Amusingly, in his 1960 classic The Fallacy of the Null-hypothesis Significance Test, Rosenbloom, apologetically characterized his own incisive attack on the inferential barrenness of NHT as “not a particularly original view”!

The critique has been refined and elaborated many times, in very useful ways, since then, too. Weirdly, the occasion for so many insightful elaborations has been the persistence of NHT despite the irrefutable proofs of those critiquing it.

More on that in in a bit, but probably the most interesting thing that has happened in the career of the critique in the last 50 yrs. or so has been the project to devise tractable alternatives to NHT that really do quantify the evidentiary weight of any particular set of data.

I’m certainly not qualified to offer a reliable account of the intellectual history of using Bayesian likelihood ratios as a test statistic in the social sciences (cf. Good. But the utlity of this strategy was clearly recognized by Rozeboom, who observed that the inferential defects in NHT could readily be repaired by analytical tools forged in the kiln of “the classic theory inverse probabilities.”

The “Bayes Factor” –actually “the” misleadingly implies that there is only one variant of it—is the most muscular, deeply theorized version of the strategy.

But one can, I believe, still get a lot of mileage out of less technically elaborate analytical strategies using likelihood ratios to assess the weight of the evidence in one’s data (e.g., Goodman, 1999b).

For many purposes, I think, the value of using Bayesian likelihood ratios is largely heuristic: having to specify the predictions that opposing plausible hypotheses would generate with respect to the data, and to formulate an explicit measure of the relative consistency of the observed outcome with each, forces the researcher to do what the dominance of NHT facilitates the evasion of: the reporting of information that enables a reflective person to draw an inference about the weight of the evidence in relation to competing explanations of the dynamic at issue.

That’s all that’s usually required for others to genuinely learn from and critically appraise a researcher’s work. For sure there are times when everything turns on how precisely one is able to estimate some quantity of interest, where key conceptual issues about how to specify one or another parameter of a Bayes Factor will have huge consequence for interpretation of the data.

But in lots of experimental models, particularly in social psychology, it’s enough to be able to say “yup, that evidence is definitely more consistent—way more consistent—with what we’d expect to see if H1 rather than H2 is true”—or instead, “wait a sec, that result is not really any more supportive of that hypothesis than this one!” In which case, a fairly straightforward likelihood ratio analysis can, I think, add a lot, and even more importantly avoid a lot of the inferential errors that accompany permitting authors to report “p < 0.05” and then make sweeping, unqualified statements not supported by their data.

That’s exactly the misadventure, I said “yesterday,” that a smart researcher experienced with NHT. That researcher found a “statistically significant” correlation (i.e., rejection of the “null at p<0.0xxx”) between a sample of Univ of Ky undergraduate’s CRT scores (Frederick 2005) and their responses to a standard polling question on “belief in” evolution; he then treated that as corroboration of his hypothesis that “individuals who are better able to analytically control their thoughts are more likely” to overcome the intuitive attraction of the idea that “living things, are … intentionally designed by some external agent” to serve some “function and purpose,” and thus “more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species.”

But as I pointed out, the author’s data, contrary to his assertion, unambiguously didn’t support that hypothesis.

Rather than showing that “analytic thinking consistently predicts endorsement of evolution,” his data demonstrated that knowing the study subjects’ CRT scores furnished absolutely no predictive insight into their “evolution beliefs.” The CRT predictor in the author’s regression model was “statistically significant” (p < 0.01), but was way too small in size to outperform a “model” that simply predicted “everyone” in the author’s sample—regardless of their CRT score—rejected science’s account of the natural history of human beings.

(Actually, there were even more serious—or maybe just more interesting—problems having to do with the author’s failure to test the data’s relative support for a genuine alternative about how cognitive reflection relates to “beliefs” in evolution: by magnifying the opposing positions of groups for whom “evolution beliefs” have become (sadly, pointlessly, needlessly) identity defining. But I focused “yesterday” on this one b/c it so nicely illustrates the NHT fallacy.)

Had he asked the question that his p-value necessarily doesn’t address—how much more consistent is the data with one hypothesis than another—he would have actually found out that the results of his study was more consistent with the hypothesis that “cognitive reflection makes no goddam difference” in what people say when they answer a standard “belief in evolution” survey item of the sort administered by Gallup or Pew.

The question I ended on, then, was,

How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don’t?

That’s a very complicated and interesting question, and so now I’ll offer my own answer, one that uses the inference-disciplining heuristic of forming a Bayesian likelihood ratio.

2 provisos:

1. Using a Baysian likelihood ratio is not, in my view, the only device that can be used to extract from data like these the information necessary to form cogent inferences about the support fo the data for study hypotheses. Anything that helps the analyst and reader guage the relative support of the data for the study hypothesis in relation to a meaningful or set of meaningful alternatives can do that.

Often it will be *obvious* how the data do that, given the sign of the value observed in the data or the size of it in relation to what common understanding tells one the competing hypotheses would predict.

But sometimes those pieces of information might not be so obvious, or might be open to debate. Or in any case, there could be circumstances in which extracting the necessary information is not so straightforward and in which a device like forming a Bayesian likelihood ratio in relation to the competing hypotheses helps, a lot, to figure out what the inferential import of the data are.

That’s the pragmatic position I mean to be staking out here in advocating alternatives to the pernicious convention of permitting researchers to treat “p < 0.05” as evidence in support of a study hypothesis.

2. My “Bayesian likelihood ratio” answer here is almost surely wrong!

But it is at least trying to answer the right question, and by putting it out there, maybe I can entice someone else who has a better answer to share it.

Indeed, it was exactly by enticing others into scholarly conversation that I came to see what was cool and important about this question. Without implying that they are at all to blame for any deficiencies in this analysis, it’s one that emerged from my on-line conversations with Gordon Pennycook, who commented on my original post on this article, and my off-line ones with Kevin Smith, who shared a bunch of enlightening thoughts with me in correspondence relating to a post that I did on an interesting paper that he co-authored.

2. What sorts of differences can the CRT reliably measure?

Here’s the most important thing to realize: the CRT is friggin hard!

It turns out that the median score on the CRT, a three-question test, is zero when administered to the general population. I kid you not: studies w/ general population samples (not student or M Turk or ones to sites that recruit from visitors to a website that offers to furnish study subjects with information on the relationship between their moral outlooks and their intellectual styles) show that 60% of the subjects can’t get a single answer correct.

Hey, maybe 60% of the population falls short of the threshold capacity in conscious, effortful information processing that critical reasoning requires. I doubt that but it’s possible.

What that means, though, is that if we use the CRT in a study (as it makes a lot of sense to do; it’s a pretty amazing little scale), we necessarily can’t get any information from our data on differences in cognitive reflection among a group of people comprising 60% of the population. Accordingly, if we had two groups neither of whose mean scores were appreciably above the “population mean,” we’d be making fools of ourselves to think we were observing any real difference: the test just doesn’t have any measurement precision or discrimination at that “low” a level of the latent disposition.

We can be even more precise about this — and we ought to be, in order to figure out how “big” a difference in mean CRT scores would warrant saying stuff like “group x is more reflective than group y” or “differences in cognitive reflection ‘predict’/’explain’ membership in group x as opposed to y….”

Using item response theory, which scores the items on the basis of how likely a person with any particular level of the latent disposition (theta) is to get that particular item correct, we can assess the measurement precision of an assessment instrument at any point along theta. We can express that measurement precision in terms of a variable “reliability coefficient,” which reflects what fraction of the differences in individual test scores in that vicinity of theta is attributable to “true differences” & how much to measurement error.

Here’s what we get for CRT (based on a general population sample of about 1800 people):

The highest degree of measurement precision occurs around +1 SD, or approximately “1.7” answers correct. Reliability there is 0.60, which actually is pretty mediocre; for something like the SAT, it would be pretty essential to have 0.8 along the entire continuum from –2 to +2 SD. That’s b/c there is so much at stake, both for schools that want to rank students pretty much everywhere along the continuum, and for the students they are ranking.

But I think 0.60 is “okay” if one is trying to make claims about groups in general & not rank individuals. If one gets below 0.5, though, the correlations between the latent variable & anything else will be so attenuated as to be worthless….

So here are some judgments I’d make based on this understanding of the psychometric properties of CRT:

If the “true” mean CRT scores of two groups — like “conservatives” & “liberals” or “evolution believers” & “disbelievers” — are both within the red zone, then one has no reasonable grounds for treating the two as different in their levels of reflection: CRT just doesn’t have the measurement precision to justify the claim that the higher-scoring group is “more reflective “even if the difference in means is “statistically significant.”
Obviously, if one group’s true mean is in the red zone and another’s in the green or yellow, then we can be confident the two really differ in their disposition to use conscious, effortful processing.
Groups within the green zone probably can be compared, too. There’s reasonable measurement precision there– although it’s still iffy (alpha is about 0.55 on avg…).

If I want to see if groups differ in the reflectiveness, then, I should not be looking to see if the difference in their CRT scores is “significant p < 0.05,” since that by itself won’t support any inferences relating to the hypotheses given my guidelines above.

If one group has a “true” mean CRT score that is in the “red” zone, the hypothesis that it is less reflective than another group can be supported with CRT results only if the latter group’s “true” mean score is in the green zone.

3. Using likelihood ratios to weigh the evidence on “whose is bigger?”

So how can we can this information to form a decent hypothesis testing strategy here?

Taking the “CRT makes no goddam difference” position, I’m going to guess that those who “don’t believe” in evolution are pretty close to the population mean of “0.7.” If so, then those who “do believe” will need to have a “true” mean score of +0.5 SD or about “1.5 answers correct” before there is a “green to red” zone differential.

That’s a difference in mean score of approximately “0.8 answers correct.”

Thus, the “believers more reflective” hypothesis, then, says we should expect to find that believers will have a mean score 0.8 points higher than the population mean, or 1.5 correct.

The “no goddam difference” hypothesis, we’ll posit, predicts the “null”: no difference whatsoever in mean CRT scores of the believers & nonbelievers.

Now turning to the data, it turns out the “believers” in author’s sample had a mean CRT of 0.86, SEM = .07. The “nonbelievers” had a mean CRT score of 0.64, SEM =0.05.

I calculate the a difference as 0.22, SEM = 0.08.

Again, it doesn’t matter that this difference is “statistically significant”—at p < 0.01 in fact. What we want to know is the inferential import of this data for our competing hypotheses. Which one does it support more—and how much more supportive is it?

As indicated at the beginning, a really good (or Good) way to gauge the weight of the evidence in relation to competing study hypotheses is through the use of Bayesian likelihood ratios. To calculate them, we look at where the observed difference in mean CRT scores falls in the respective probability density distributions associated with the “no goddam difference” and “believers more reflective” hypotheses.

By comparing how probable it is that we’d observe such a value under each hypothesis, we get the Bayesian likelihood ratio, which is how much more consistent the data are with one hypothesis than the other:

The author’s data are thus roughly 2000 times more consistent with the “no goddam difference” prediction than with the “believers more reflective” prediction.

Roughly! Figuring out the exact size of this likelihood ratio is not important.

All that matters—all I’m using the likelihood ratio, heuristically, to show—is that we can now see that, given what we know CRT is capable of measuring among groups whose scores are so close to the population mean, that the size of the observed difference in mean CRT scores is orders of magnitude more consistent with the “no goddam difference” hypothesis than with the “believers more reflective” hypothesis, notwithstanding its “stastical significance.”

That’s exactly why it’s not a surprise that a predictive model based on CRT scores does no better than a model that just uses the population (or sample) frequency to predict whether any given student (regardless of his or her CRT scores) believes in in evolution.

Constructing a Bayesian likelihood ratio here was so much fun that I’m sure you’ll agree we should do it one more time.

In this one, I’m going to re-analyze data from another study I recently did a post on: Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology,” Judgment and Decision Making, July 2015, pp. 314–331, by Deppe, Gonzalez, Neiman, Jackson Pahlke, the previously mentioned Kevin Smith & John Hibbing.

Here the authors reported data on the correlation between CRT scores and individuals identified with reference to their political preferences. They reported that CRT scores were negatively correlated (p < 0.05) with various conservative position “subscales” in various of their convenience samples, and with a “conservative preferences overall” scale in a stratified nationally representative sample. They held out these results as “offer[ing] clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives.”

As I pointed out in my earlier post, I thought the authors were mistaken in reporting that their data showed any meaningful correlation—much less a statistically significant one—with “conservative preferences overall” in their nationally representative sample; they got that result, I pointed out, only because they left 2/3 of the sample out of their calculation.

I did point out, too, that the reported correlations seemed way to small, in any case, to support the conclusion that “liberals” are “more reflective” than conservatives. It was Smith’s responses in correspondence that moved me to try to formulate in a more systematic way an answer to the question that a p-value, no matter how miniscule, begs: namely, just “how big” a difference two groups “true” mean CRT scores has to be before one can declare one to be “more reflective,” “analytical,” “open-minded,” etc. than the another.

Well, let’s use likelihood ratios to measure the strength of the evidence in the data in just the 1/3 of the nationally representative sample that the authors used in their paper.

Once more, I’ll assume that “conservatives” are about average in CRT—0.7.

So again, the “liberal more reflective” hypothesis predicts we should expect to find that liberals will have a mean score 0.8 points higher than the population mean, or 1.5 correct. That’s the minimum difference for group means on CRT necessary to register a difference for a group to be deemed more reflective than another whose scores are close to the population mean.

Again, the “no goddam difference” hypothesis predicts the “null”: here no difference whatsoever in mean CRT scores of liberal & conservatives.

By my calculation, in the subsample of the data in question “conservatives” in (individuals above mean on the “conservative positions overall” scale) have a mean CRT of 0.55, SE = 0.08; “liberals” a mean score of 0.73, SE = 0.08.

The estimated difference (w/ rounding) in means is 0.19, SE = 0.09.

So here is the likelihood ratio assessment of the relative support of the evidence for the two hypotheses:

Again, the data are orders of magnitude more consistent with “makes no goddam difference.”

Once more, whether the difference is “5×10^3” or 4.6×10^3 or even 9.7×10^2 or 6.3×10^4 is not important.

What is is that there’s clearly much much much more reason for treating this data as supporting an inference diametrically opposed to the one drawn by the authors.

Or at least there is if I’m right about how to specify the range of possible observations we should expect to see if the “makes no goddam difference” hypothesis is true and the range of possible observations we should expect to see if the “liberals are more reflective than conservatives” hypotheses is true.

Are those specifications correct?

Maybe not! They’re just the best ones I can come up with for now!

If someone sees a problem & better still a more satisfying solution, it would be very profitable to discuss that!

What’s not even worth discussing, though, is that “rejecting the null at p<0.05” is the way to figure out if the data supports the strong conclusions these papers purport to draw– becaues in fact, that information does not support any particular inference on its own.

What to make of this

The point here isn’t to suggest any distinctive defects in these papers, both of which actually report interesting data.

Again, these are just illustrations of the manifest deficiency of NHT, and in particular the convention of treating “rejection of the null at p < 0.05”—by itself! – as license for declaring the observed data as supporting a hypothesis, much less as “proving” or even furnishing “strong,” “convincing” etc. evidence in favor of it.

And again in applying this critique to these particular papers, and in using Bayesian likelihood ratios to liberate the inferential significance locked up in the data, I’m not doing anything the least bit original!

On the contrary, I’m relying on arguments that were advanced over 50 years ago, and that have been strengthened and refined by myriad super smart people in the interim.

For sure, exposure of the “NHT fallacy” reflected admirable sophistication on the part of those who developed the critique.

But as I hope what I’ve showing the last couple of posts is that the defects in NHT that these scholars identified is really really easy to understand. Once it’s been pointed out; any smart middle schooler can readily grasp it!

So what the hell is going on?

I think the best explanation for the persistence of the NHT fallacy is that it is a malignant craft norm.

Treating “rejection of the null at p < 0.05” as license for asserting support of one’s hypothesis is “just the way the game works,” “the way it’s done.” Someone being initiated into the craft can plainly see that in the pages of the leading journals, and in the words and attitudes—the facial expressions, even—of the practitioners whose competence and status is vouched for by all of their NHT-based publications and by the words, and attitudes (and even facial expressions even) of other certified members of the field.

Most of those who enter the craft will therefore understandably suppress whatever critical sensibilities might otherwise have altered them to the fallacious nature of this convention. Indeed, if they can’t do that, they are likely to find the path to establishing themselves barred by jagged obstacles.

The way to progress freely down the path is to produce and get credit and status for work that embodies the NHT fallacy. Once a new entrant gains acceptance that way, then he or she too acquires a stake in the vitality of the convention, one that not only reinforces his or her aversion to seriously interrogating studies that rest on the fallacy but that also motivates him or her to evince thereafter the sort of unquestioning, taken-for-granted assent that perpetuates the convention despite its indisputably fallacious character.

And in case you were wondering, this diagnosis of the malignancy of NHT as a craft norm in the social sciences is not the least bit original to me either! It’s was Rozenboom’s diagnosis over 50 yrs ago.

So I guess we can see it’s a slow-acting disease. But make no mistake, it’s killing its host.

Refs

Cohen, J. The Earth is Round (p < .05). Am Psychol 49, 997 – 1003 (1994).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research.Psych Rev 70, 193 – 242 (1963).

Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).

Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999a).

Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med 130, 995 – 1004 (1999b).

Rozeboom, W.W. The fallacy of the null-hypothesis significance test. Psychological bulletin 57, 416 (1960).

Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).

How big a difference in mean CRT scores is “big enough” to matter? or NHT: A malignant craft norm, part 2

Leave a Comment Cancel reply