## How big a difference in mean CRT scores is "big enough" to matter? or NHT: A malignant craft norm, part 2

**1. Now where was I . . . ?**

Right . . . So yesterday I posted part I of this series, which is celebrating the bicentennial , or perhaps it’s the tricentennial—one loses track after a while--of the “NHT Fallacy” critique

The nerve of it is that “rejection of the null [however it is arbitrarily defined] at p < 0.05 [or p < 10^-50 or whatever]” furnishes no inferentially relevant information in hypothesis testing. To know whether an observation counts as evidence in support of a hypothesis, the relevant information is not how likely we were to observe a particular value if the “null” is true but how much more or less likely we were to observe that value if a particular hypothesized true “value” is correct than if another hypothesized “true” value is correct (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999a; Gigerenzer 2004).

Actually, I’m not sure when the first formulation of the critique appeared. Amusingly, in his 1960 classic *The Fallacy of the Null-hypothesis Significance Test*, Rosenbloom, apologetically characterized his own incisive attack on the inferential barrenness of NHT as “not a particularly original view”!

The critique has been refined and elaborated many times, in very useful ways, since then, too. Weirdly, the occasion for so many insightful elaborations has been the persistence of NHT despite the irrefutable proofs of those critiquing it.

More on that in in a bit, but probably the most interesting thing that has happened in the career of the critique in the last 50 yrs. or so has been the project to devise tractable alternatives to NHT that really do quantify the evidentiary weight of any particular set of data.

I’m certainly not qualified to offer a reliable account of the intellectual history of using Bayesian likelihood ratios as a test statistic in the social sciences (cf. Good. But the utlity of this strategy was clearly recognized by Rozeboom, who observed that the inferential defects in NHT could readily be repaired by analytical tools forged in the kiln of “the classic theory inverse probabilities.”

The “Bayes Factor” –actually “the” misleadingly implies that there is only one variant of it—is the most muscular, deeply theorized version of the strategy.

But one can, I believe, still get a lot of mileage out of less technically elaborate analytical strategies using likelihood ratios to assess the weight of the evidence in one’s data (e.g., Goodman, 1999b).

For many purposes, I think, the value of using Bayesian likelihood ratios is largely heuristic: having to specify the predictions that opposing plausible hypotheses would generate with respect to the data, and to formulate an explicit measure of the relative consistency of the observed outcome with each, forces the researcher to do what the dominance of NHT facilitates the evasion of: the reporting of information that enables a reflective person to draw an inference about the weight of the evidence in relation to competing explanations of the dynamic at issue.

That’s all that’s usually required for others to genuinely learn from and critically appraise a researcher’s work. For sure there are times when everything turns on how precisely one is able to estimate some quantity of interest, where key conceptual issues about how to specify one or another parameter of a Bayes Factor will have huge consequence for interpretation of the data.

But in lots of experimental models, particularly in social psychology, it’s enough to be able to say “yup, that evidence is definitely more consistent—way more consistent—with what we’d expect to see if H1 rather than H2 is true”—or instead, “wait a sec, that result is not really any more supportive of that hypothesis than this one!” In which case, a fairly straightforward likelihood ratio analysis can, I think, add a lot, and even more importantly avoid a lot of the inferential errors that accompany permitting authors to report “p < 0.05” and then make sweeping, unqualified statements not supported by their data.

That’s exactly the misadventure, I said “yesterday,” that a smart researcher experienced with NHT. That researcher found a “statistically significant” correlation (i.e., rejection of the “null at p<0.0xxx”) between a sample of Univ of Ky undergraduate’s CRT scores (Frederick 2005) and their responses to a standard polling question on “belief in” evolution; he then treated that as corroboration of his hypothesis that “individuals who are better able to analytically control their thoughts are more likely” to overcome the intuitive attraction of the idea that “living things, are ... intentionally designed by some external agent” to serve some “function and purpose,” and thus “more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species."

But as I pointed out, the author’s data, contrary to his assertion, unambiguously *didn’t* support that hypothesis.

Rather than showing that “analytic thinking consistently predicts endorsement of evolution,” his data demonstrated that knowing the study subjects’ CRT scores furnished absolutely no predictive insight into their "evolution beliefs." The CRT predictor in the author’s regression model was “statistically significant” (p < 0.01), but was *way too small in size *to outperform a “model” that simply predicted “everyone” in the author’s sample—regardless of their CRT score—rejected science’s account of the natural history of human beings.

(Actually, there were even more serious—or maybe just more interesting—problems having to do with the author’s failure to test the data's relative support for a genuine alternative about how cognitive reflection relates to "beliefs" in evolution: by magnifying the opposing positions of groups for whom "evolution beliefs" have become (sadly, pointlessly, needlessly) identity defining. But I focused “yesterday” on this one b/c it so nicely illustrates the NHT fallacy.)

Had he asked the question that his p-value necessarily doesn’t address—how much more consistent is the data with one hypothesis than another—he would have actually found out that the results of his study was more consistent with the hypothesis that “cognitive reflection makes no goddam difference” in what people say when they answer a standard “belief in evolution” survey item of the sort administered by Gallup or Pew.

The question I ended on, then, was,

*How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection** **do** **“predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they** **don't**?*

That’s a very complicated and interesting question, and so now I’ll offer my own answer, one that uses the inference-disciplining heuristic of forming a Bayesian likelihood ratio.

2 provisos:

1. Using a Baysian likelihood ratio is not, in my view, the *only* device that can be used to extract from data like these the information necessary to form cogent inferences about the support fo the data for study hypotheses. Anything that helps the analyst and reader guage the relative support of the data for the study hypothesis in relation to a meaningful or set of meaningful alternatives can do that.

Often it will be *obvious* how the data do that, given the sign of the value observed in the data or the size of it in relation to what common understanding tells one the competing hypotheses would predict.

But sometimes those pieces of information might not be so obvious, or might be open to debate. Or in any case, there could be circumstances in which extracting the necessary information is not so straightforward and in which a device like forming a Bayesian likelihood ratio in relation to the competing hypotheses helps, a lot, to figure out what the inferential import of the data are.

That's the pragmatic position I mean to be staking out here in advocating alternatives to the pernicious convention of permitting researchers to treat "p < 0.05" as evidence in support of a study hypothesis.

2. My "Bayesian likelihood ratio" answer here is almost surely wrong!

But it *is* at least trying to answer the right question, and by putting it out there, maybe I can entice someone else who has a better answer to share it.

Indeed, it was exactly by enticing others into scholarly conversation that I came to see what was cool and important about this question. Without implying that they are at all to blame for any deficiencies in this analysis, it’s one that emerged from my on-line conversations with Gordon Pennycook, who commented on my original post on this article, and my off-line ones with Kevin Smith, who shared a bunch of enlightening thoughts with me in correspondence relating to a post that I did on an interesting paper that he co-authored.

**2. What sorts of differences can the CRT reliably measure?**

Here’s the most important thing to realize: the CRT is friggin hard!

It turns out that the *median* score on the CRT, a three-question test, is *zero* when administered to the general population. I kid you not: studies w/ general population samples (not student or M Turk or ones to sites that recruit from visitors to a website that offers to furnish study subjects with information on the relationship between their moral outlooks and their intellectual styles) show that 60% of the subjects can't get a single answer correct.

Hey, maybe 60% of the population falls short of the threshold capacity in conscious, effortful information processing that critical reasoning requires. I doubt that but it's possible.

What that means, though, is that if we use the CRT in a study (as it makes a lot of sense to do; it’s a pretty amazing little scale), we necessarily can't get any information from our data on *differences* in cognitive reflection among a group of people comprising 60% of the population. Accordingly, if we had two groups *neither of whose *mean scores were appreciably above the "population mean," we'd be making fools of ourselves to think we were observing any real difference: the test just doesn't have any measurement precision or discrimination at that "low" a level of the latent disposition.

We can be even more precise about this -- and we ought to be, in order to figure out how "big" a difference in mean CRT scores would warrant saying stuff like "group x is more reflective than group y" or "differences in cognitive reflection 'predict'/'explain' membership in group x as opposed to y...."

Using item response theory, which scores the items on the basis of how likely a person with any particular level of the latent disposition (theta) is to get that particular item correct, we can assess the measurement precision of an assessment instrument at any point along theta. We can express that measurement precision in terms of a variable "reliability coefficient," which reflects what fraction of the differences in individual test scores in that vicinity of theta is attributable to "true differences" & how much to measurement error.

Here's what we get for CRT (based on a general population sample of about 1800 people):

The highest degree of measurement precision occurs around +1 SD, or approximately "1.7" answers correct. Reliability there is 0.60, which actually is pretty mediocre; for something like the SAT, it would be pretty essential to have 0.8 along the entire continuum from *-*2 to +2 SD. That’s b/c there is so much at stake, both for schools that want to rank students pretty much everywhere along the continuum, and for the students they are ranking.

But I think 0.60 is "okay" if one is trying to make claims about groups in general & not rank individuals. If one gets below 0.5, though, the correlations between the latent variable & anything else will be so attenuated as to be worthless....

So here are some *judgments *I'd make based on this understanding of the psychometric properties of CRT:

- If the "true" mean CRT scores of two groups -- like "conservatives" & "liberals" or "evolution believers" & "disbelievers" -- are
**both**within the red zone, then one has no reasonable grounds for treating the two as different in their levels of reflection: CRT just doesn't have the measurement precision to justify the claim that the higher-scoring group is "more reflective “even if the difference in means is "statistically significant." - Obviously, if one group's true mean is in the red zone and another's in the green or yellow, then we can be confident the two really differ in their disposition to use conscious, effortful processing.
- Groups within the green zone probably can be compared, too. There's reasonable measurement precision there-- although it's still iffy (alpha is about 0.55 on avg...).

If I want to see if groups differ in the reflectiveness, then, I should not be looking to see if the difference in their CRT scores is "significant p < 0.05," since that by itself *won't support any inferences* relating to the hypotheses given my guidelines above.

If one group has a "true" mean CRT score that is in the "red" zone, the hypothesis that it is less reflective than another group can be supported with CRT results *only* if the latter group's "true" mean score is in the green zone.

**3. Using likelihood ratios to weigh the evidence on “whose is bigger?”**

So how can we can this information to form a decent hypothesis testing strategy here?

Taking the "CRT makes no goddam difference" position, I'm going to guess that those who "don't believe" in evolution are pretty close to the population mean of "0.7." If so, then those who "do believe" will need to have a “true” mean score of +0.5 SD or about "1.5 answers correct" before there is a "green to red" zone differential.

That's a difference in mean score of approximately "0.8 answers correct."

Thus, the "believers more reflective" hypothesis, then, says we should expect to find that believers will have a mean score 0.8 points higher than the population mean, or 1.5 correct.

The “no goddam difference” hypothesis, we’ll posit, predicts the "null": no difference whatsoever in mean CRT scores of the believers & nonbelievers.

Now turning to the data, it turns out the "believers" in author’s sample had a mean CRT of 0.86, SEM = .07. The "nonbelievers" had a mean CRT score of 0.64, SEM =0.05.

I calculate the a difference as 0.22, SEM = 0.08.

Again, it doesn’t matter that this difference is “statistically significant”—at p < 0.01 in fact. What we want to know is the inferential import of this data for our competing hypotheses. Which one does it support more—and how much more supportive is it?

As indicated at the beginning, a really good (or Good) way to gauge the weight of the evidence in relation to competing study hypotheses is through the use of Bayesian likelihood ratios. To calculate them, we look at where the observed difference in mean CRT scores falls in the respective probability density distributions associated with the “no goddam difference” and “believers more reflective” hypotheses.

By comparing how probable it is that we’d observe such a value under each hypothesis, we get the Bayesian likelihood ratio, which is how much more consistent the data are with one hypothesis than the other:

The author’s data are thus roughly 2000 times more consistent with the “no goddam difference” prediction than with the “believers more reflective” prediction.

**Roughly!** Figuring out the exact size of this likelihood ratio is *not* important.

All that matters—all I’m using the likelihood ratio, heuristically, to show—is that we can now see that, given what we know CRT is capable of measuring among groups whose scores are so close to the population mean, that the size of the observed difference in mean CRT scores is **orders of magnitude** more consistent with the “no goddam difference” hypothesis than with the “believers more reflective” hypothesis, notwithstanding its "stastical significance."

That’s exactly why it’s not a surprise that a predictive model based on CRT scores does no better than a model that just uses the population (or sample) frequency to predict whether any given student (regardless of his or her CRT scores) believes in in evolution.

Constructing a Bayesian likelihood ratio here was so much fun that I’m sure you’ll agree we should do it one more time.

In this one, I’m going to re-analyze data from another study I recently did a post on: Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology,” Judgment and Decision Making, July 2015, pp. 314–331, by Deppe, Gonzalez, Neiman, Jackson Pahlke, the previously mentioned Kevin Smith & John Hibbing.

Here the authors reported data on the correlation between CRT scores and individuals identified with reference to their political preferences. They reported that CRT scores were negatively correlated (p < 0.05) with various conservative position “subscales” in various of their convenience samples, and with a “conservative preferences overall” scale in a stratified nationally representative sample. They held out these results as “offer[ing] clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives.”

As I pointed out in my earlier post, I thought the authors were mistaken in reporting that their data showed any meaningful correlation—much less a statistically significant one—with “conservative preferences overall” in their nationally representative sample; they got that result, I pointed out, only because they left 2/3 of the sample out of their calculation.

I did point out, too, that the reported correlations seemed way to small, in any case, to support the conclusion that “liberals” are “more reflective” than conservatives. It was Smith’s responses in correspondence that moved me to try to formulate in a more systematic way an answer to the question that a *p-*value, no matter how miniscule, begs: namely, just “how big” a difference two groups “true” mean CRT scores has to be before one can declare one to be “more reflective,” “analytical,” “open-minded,” etc. than the another.

Well, let’s use likelihood ratios to measure the strength of the evidence *in* the data in just the 1/3 of the nationally representative sample that the authors used in their paper.

Once more, I’ll assume that “conservatives” are about average in CRT—0.7.

So again, the "liberal more reflective" hypothesis predicts we should expect to find that liberals will have a mean score 0.8 points higher than the population mean, or 1.5 correct. That’s the minimum difference for group means on CRT necessary to register a difference for a group to be deemed more reflective than another whose scores are close to the population mean.

Again, the “no goddam difference” hypothesis predicts the "null": here no difference whatsoever in mean CRT scores of liberal & conservatives.

By my calculation, in the subsample of the data in question “conservatives” in (individuals above mean on the “conservative positions overall” scale) have a mean CRT of 0.55, SE = 0.08; “liberals” a mean score of 0.73, SE = 0.08.

The estimated difference (w/ rounding) in means is 0.19, SE = 0.09.

So here is the likelihood ratio assessment of the relative support of the evidence for the two hypotheses:

Again, the data are orders of magnitude more consistent with “makes no goddam difference.”

Once more, whether the difference is “5x10^3” or 4.6x10^3 or even 9.7x10^2 or 6.3x10^4 is not important.

What is is that there’s clearly much much much more reason for treating this data as supporting an inference diametrically opposed to the one drawn by the authors.

Or at least there is if I’m right about how to specify the *range* of possible observations we should expect to see *if* the “makes no goddam difference” hypothesis is true and the *range* of possible observations we should expect to see if the “liberals are more reflective than conservatives” hypotheses is true.

Are those specifications correct?

Maybe not! They're just the best ones I can come up with for now!

If someone sees a problem & better still a more satisfying solution, it would be very profitable to discuss that!

What's not even worth discussing, though, is that "rejecting the null at p<0.05" is the way to figure out if the data supports the strong conclusions these papers purport to draw-- becaues in fact, that information does not support any particular inference on its own.

**4. What to make of this**

The point here isn’t to suggest any distinctive defects in these papers, both of which actually report interesting data.

Again, these are just illustrations of the manifest deficiency of NHT, and in particular the convention of treating “rejection of the null at p < 0.05”—by itself! – as license for declaring the observed data as supporting a hypothesis, much less as “proving” or even furnishing “strong,” “convincing” etc. evidence in favor of it.

And **again**** **in applying this critique to these particular papers, and in using Bayesian likelihood ratios to liberate the inferential significance locked up in the data, I’m *not* doing anything the least bit original!

On the contrary, I’m relying on arguments that were advanced over 50 years ago, and that have been strengthened and refined by myriad super smart people in the interim.

For sure, exposure of the “NHT fallacy” reflected admirable sophistication on the part of those who developed the critique.

But as I hope what I’ve showing the last couple of posts is that the defects in NHT that these scholars identified is really really easy to understand. Once it’s been pointed out; any smart middle schooler can readily grasp it!

So what the hell is going on?

I think the best explanation for the persistence of the NHT fallacy is that it is a **malignant craft norm**.

Treating “rejection of the null at p < 0.05” as license for asserting support of one’s hypothesis is “just the way the game works,” “the way it’s done.” Someone being initiated into the craft can plainly see that in the pages of the leading journals, and in the words and attitudes—the facial expressions, even—of the practitioners whose competence and status is vouched for by all of their NHT-based publications and by the words, and attitudes (and even facial expressions even) of other certified members of the field.

Most of those who enter the craft will therefore understandably suppress whatever critical sensibilities might otherwise have altered them to the fallacious nature of this convention. Indeed, if they can’t do that, they are likely to find the path to establishing themselves barred by jagged obstacles.

The way to progress freely down the path is to produce and get credit and status for work that embodies the NHT fallacy. Once a new entrant gains acceptance that way, then he or she too acquires a *stake* in the vitality of the convention, one that not only reinforces his or her aversion to seriously interrogating studies that rest on the fallacy but that also motivates him or her to evince thereafter the sort of unquestioning, taken-for-granted assent that perpetuates the convention despite its indisputably fallacious character.

And in case you were wondering, this diagnosis of the malignancy of NHT as a craft norm in the social sciences is not the least bit original to me either! It’s was Rozenboom’s diagnosis over 50 yrs ago.

So I guess we can see it’s a slow-acting disease. But make no mistake, it’s killing its host.

**Refs**

Cohen, J. The Earth is Round (p < .05). *Am Psychol ***49**, 997 - 1003 (1994).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research.*Psych Rev ***70**, 193 - 242 (1963).

Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. *Annals of internal medicine ***130**, 1005-1013 (1999a).

Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. *Ann Int Med ***130**, 995 - 1004 (1999b).

Rozeboom, W.W. The fallacy of the null-hypothesis significance test. *Psychological bulletin ***57**, 416 (1960).

Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).

## Reader Comments (11)

Hi Dan,

I agree with you wholeheartedly about the challenges of trying to extract any useful information from the CRTT. The median score is zero, it consists of only three items, and the three items are all dichotomous. This makes for a really crude measurement, and I'm skeptical of its utility in measuring anything but the most obvious of associations.

I'd also like to applaud you for embracing the Bayesian hypothesis-comparison approach as an alternative to the NHST hypothesis-rejection approach. This process of comparing predictions has, as you note, tremendous value.

With regard to the Bayes factors, however, you might be overstating the evidence for the null by a bit. Bayesians generally do not like to choose a point-alternative hypothesis like the H1: delta = 0.8 you show here, because such a point-alternative is too restrictive. Often, there are many possible effect sizes that would be relevant to a theory. For that reason, we often choose diffuse alternative hypotheses like H1: delta ~ Normal(0, .5) or their one-tailed equivalents.

As an example, I used Zoltan Dienes' Bayes calculator at http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_factor.swf, using d = .2, se = .09

When I compare against H1: delta = .8, I also get something like 4x10^3 in favor of H0.

However, when I compare against H1: delta ~ Normal(0, .5), one-tailed, I got modest evidence in favor of the alternative hypothesis, 3.82. A two-tailed test would instead find little evidence one way or the other, BF = 1.94.

So what do I make of this? I agree with you that the difference is

certainlynot so large as delta = 0.8. However, if one had specifically predicted a small but positive association, this paper provides a little evidence for that. But if you didn't have a stronga prioriprediction for which way people would fall (e.g. maybe evolution skeptics could be more questioning of things in general and thereby less likely to fall for the trap answers on the CRT), then you're more or less where you started.@Joe--

Thanks!

Would you agree that we are trying to specify some plausible range of values, ones that can be arrayed in a density distribution, to represent the respective hypotheses?

If so, then one question how to structure those values, correct?

I have spelled out my reason for thinking a "true" effect size of 0.8 is the right one for H1--and have obviously availed myself of a tremendous amount of simplicity by treating the density distribution for "observed values" consistent with H1 as being normal (why not cauchy distrbution? or uniform distribuiton?), 2-tail, etc. I think that' defensible given my modest conception of LRs as "heuristic" gauges of evidentiary weight. But if not, then I think those who think Bayes Factor invovles too much guesswork on those sorts of matters have a point ....

But isn't the nerve of your point simply that you think the range of normally distributed values for H1 should be made much more dense to left of 0.8 -- because "[o]ften, there are many possible effect sizes that would be relevant to a theory"?

To me the biggest problem w/ NHT was that by accepting it we give the researcher a complete pass on specifying what "effect sizes would be relevant" to his theory!

Here there is to be no pass-- whether in the form of "reject the null, p < 0.05"

or"often there are many possible values that are relevant ...."I think the effort to specify the *relevant* "true" effect size here -- the effect we'd have to see before we took any difference in means seriously-- is the whole point of using a likelihood ratio alternative to NHT.

If someone told me that their prediction was a true mean effect size difference of "0.3-0.8" for groups whose CRT scores are as close to the population mean as one would expect all "liberals" & all "conservatives" to be, I'd say "why should I care? I know that you care comparing values within a zone of CRT that has no meaningful discrimination!"

So tell me how you think H1 should be specified given what the psychometric properties of CRT are. And tell me how H0 should be specified too-- b/c it doesn't have to be a point estimate of 0 for someone who understands that the CRT lacks discrimination except for groups who are very widely separate or groups more closely separated within a very very very narrow portion of theta. I think making H0 the null was the most conservative thing I did here in trying to use likellihood ratios to discipline the inquiry for inferential meaning in the data.

Disagree? You might well-- vehemently! That's fine: I'm eager to be set right if I'm off.

Hi Dan,

I'm noticing I misunderstood your post slightly and thought you were talking about a

standardized mean differenceof 0.8, which would be unrealistically large for most of social psychology, instead of araw mean differenceof 0.8, equivalent to 0.5 Cohen's d here. That's more reasonable for effects in social psych. That means that, as the observed raw mean difference is 0.2, and SD is 1.6, then the observed effect is d = 0.125, which will be rather more consistent with the null. It just won't be 4000 : 1 evidence for the null.My point about distributed ranges isn't to say that any value is consistent with theory (e.g. NHST's notorious delta != 0), but that a

constrainedrange of values would be consistent. Instead of taking all your probability and spending it on delta = .5, you spend some of it between 0 and .2, some more between .2 and .5, some more between .5 and 1... There's a whole literature on how to choose a prior (typically using normal-like distributions), and it is contentious, but it's not as though one is allowed completely free reign with such choices. To say that either it's 0.8 or it isn't seems to me a little too strident.As for psychometric properties, I'm not well-acquainted with IRT, but low reliability would, if anything, reduce the

observedcorrelation between the latent construct (thinking style) and outcome (beliefs), as many people with "true" CRT scores of 0.7 will be sloppily categorized as 0 or 1 (or maybe even 2 or 3, if they're lucky). This introduction of measurement error would reduce the effect size, if I understand it correctly, as it increases the error while keeping the mean difference the same. So the question of whether there's apracticaldifference, as though you were one day given the task of predicting people's CRT scores given their evolutionary beliefs, is a little different from the question of whether there's atheoretical, population-scale difference of the type that could be studied given a big sample.Anyway, rescaling things in light of standardized vs. unstandardized means, I get 2.53 : 1 in favor of one-tailed vs. null, 1.28 : 1 in favor of the two-tailed vs. the null. It's not strong evidence either way, and so it's not a hypothesis I'd wager money on.

@Joe:

I agree this is a very simplified way to approach calculating Bayesian likelihood ratios here-- & I have deliberately avoided characterizing my calculatoin as the "Bayes factor," in deference to how admirably attentive those who use them are to diffuse hypotheses, the properties of the probability density distribution associated with them, and the like.

But I'm not sure why you believe it makes a difference to transform the raw difference in means into a cohen's d or any other standardized effect size measure.

If we are warranted in treating each hypothesis as generating a point estimate for "true differences in means" & know the standard errors associated with those estimates, we can calculate the probability of the *observed difference in means* conditional on each hypothesis & compare directly. That will give us a likelihood ratio representation of the evidence of the data in relation to the hypotheses--the straightforward procedure I've followed here.

Disagree?

I would have thought the more contentious issue was whether it's justified to be unimpressed with any hypothesis that doesn't predict a difference of at least 0.8 in mean CRT scores--but given how ridiculously little reliability there is in scores that differ from one another by less than that (when the two means are themselves less than 0.5 SDs from the mean on theta), I do think that's justified. Unless & until we have a cognitive reflection measure that doesn't give up entirely on distinguishing differences in that disposition for 60% of the population, I think that's the high bar that has to be cleared.

You note that I connect the "how big" question for difference in CRT to "practical" issue of being able to given an account of how differences in reflection might figure in some other difference in the beliefs or behaviors of the groups. You say you aren't sure how to think about these things; I'm perplexed too. Imagine we *simulated* a difference in CRT scores between groups the lower-scoring one had a mean score approxiately -0.15 answers fewer correct than population mean on CRT; how much *bigger* do you think the difference would have to be before CRT would actually improve our ability to classify individuals correctly into the groups? Why isn't that a helpful start? Rmember too, that we aren't talking about contracting lung cancer or some other very infrquent disease; the groups are divided about 62-38 on "evolution disbelief." (BTW, I can tell you a model that uses religiosity, CRT, and interaction of the two will *definitely* kick the ass out of the "everyone disbelieves"!)

Last point: If you are willing to bet

in favorof finding a 0.8 difference in means of "evolution believers" & "disbelievers" in any given randomly selected N=700 sample of University of Ky undergrads, I'll very happily lay10^3 odds against & accept any wager of up to 1 "Romney" ($10,000).I have a basic question for you, Dan. If you're interested in asking if a person's CRT score makes a difference in whether or not they "believe in evolution" (call that variable Evolution), why are you testing these hypotheses about the distribution of CRT | Evolution? Isn't the hypothesis you're interested in a statement about the distribution of Evolution | CRT? This is confusing me; maybe I've lost the train of argument in this extended thread.

I wouldn't expect to see a difference in CRT | Evolution vs. CRT | ~Evolution because there are plenty of people who "believe in evolution" without thinking carefully about counterintuitive math questions. I know plenty of people who went into the life sciences because they didn't like math riddles.

Hi Dan. I was struck by your comment about how adherence to the NHT is a "malignant craft norm."

I think you've identified another group that might be suffering from identity-protective cognition.

Just replace "white males" with social scientists.

A historical aside on the question of when this critique was discovered:

Cohen, at 1000, states that the earliest identifier of this problem he was aware of was "Some difficulties of interpretation encountered in the application of the chi-square test", by Joseph Berkson in 1938. Tracing back further is difficult since Rozeboom and Berkson's articles predate current citation practices, but at a minimum Berkson appears to have noticed the problem independently of others.

The worst possible coda for the faulty methodological memory of the sciences, though, is Cohen's citation to Popper's Logic(!), at 999, as an alternate approach to "strong hypothesis testing" This would mean the problem, and its solutions, are as old as modern scientific practice itself- especially if Popper addressed the issue in the original 1934 edition.

@Dypoon -- isn't it just a question of what term one puts on which side of "=" in one or another model of how the two relate?

I do think the best test is impact of CRT on evolution *conditional on cultural type*; that, not the "reject thenull," tests the relative plausibility of most important alternative hypothesis to the author's own.

@Robert--

take a look at this....

@dmk38:

I'm not sure whether to laugh or cry- so long as I don't need to apply the triadic model when I'm writing up my concept explications, I'm happy to call Peirce the original source.

Uh...Which of those we're using as the predictor and which as the predicted actually matters a lot, doesn't it? The "=" in a model specification isn't generally a symmetric thing. Philosophically, when we make a garden-variety linear model, we attribute residuals to the response variable, and not the predictors. If you're doing projection to latent structures, that changes, of course, but I don't think that's what we've been doing.

I think this would be especially important in this instance where we have a population that dominantly doesn't "believe in evolution", instead of one where that binary were more balanced 50/50. If you look at CRT | Evolution, background variation in CRT will swamp a signal generated by the imbalanced population unless the effect size is large enough to compensate. The same effect would be relatively easier to distinguish by looking at Evolution | CRT.

I agree with your suggestion of a better test, of course. Incidentally, I've been trying to get the data myself, but couldn't find it on Gervais's website where he said he put it. He gave no URL in the manuscript, and the place is a mess of links. Did you find the data on his website?