I really really really like the Cognitive Reflection Test--or "CRT" (Frederick 2005).
The CRT is a compact three-item assessment of the disposition to rely on conscious, effortful, "System 2" reasoinng as opposed to rapid, heuristic-driven "System 1" reasoning. An objective or performance-based measure, CRT has been shown to be vastly superior to self-report measures like "need for cognition" ("agree or disagree-- 'thinking is not my idea of fun'; 'The notion of thinking abstractly is appealing to me' . . .") in predicting vulnerability to the various biases that reflect over-reliance on System 1 information processing (Toplak, West & Stanovich 2011).
As far as I’m concerned, Shane Frederick deserves a Nobel Prize in economics for inventing this measure every bit as much Daniel Kahneman deserved his for systematizing knowledge of the sorts of reasoning deficits that CRT predicts.
Nevertheless, CRT is just not as useful for the study of cognition as it ought to be.
The problem is not that the correct answers to its three items are too likely to be known at this point by M Turk workers—whose scores exceed those of MIT undergraduates (Chandler, Mueller & Paolacci 2014).
The mean score when it is administered to a general population sample is about 0.65 correct responses (Kahan 2013; Weller, Dieckmann, Tusler, Mertz, Burns & Peters 2012; Campitelli & Labollita, 2010).
Accordingly, if we want to study how individual differences in System 1 vs. System 2 reasoning styles interact with other dynamics—like motivated reasoning—or respond to interventions designed to improve engagement with technical information, then for half the population CRT necessarily gives us zero information.
Unless one makes the exceedingly implausible assumption that there's no variance to measure among this huge swath of people, this is a severe limitation on the value of the measure.
I've addressed this previously on this blog but I had occasion to underscore and elaborate on this point recently in correspondence with a friend who does outstanding work in the study of cognition and who (with good reason) is a big fan of CRT.
Here are some of the points I made:
I don’t doubt that CRT measures the disposition to use System 2 information processing more faithfully than, say, Numeracy [a scale that assesses proficiency in quantitative reasoning].
But the fact remains that Numeracy outperforms CRT in predicting exactly what CRT is supposed to predict—namely vulnerability to heuristic biases (Weller et al. 2012; Liberali 2012). Numeracy is getting a bigger piece of the latent disposition that CRT measures—and that's strong evidence of the need for a better CRT.
Or consider the Ordinary Science Intelligence assessment, “OSI_2.0,” the most recent version of a scale I've been working on to measure a disposition to recognize and give appropriate effect to scientific information relevant to ordinary, everyday decisions (Kahan 2014).
Cognitive reflection is among the combination of reasoning proficiencies that this (unidimensional) disposition comprises.
But for sure, I didn't construct OSI_2.0 to be "CRT_2.0.” I created it to help me & others do a better job in investigating how to asses the relationship between science comprehension and dynamics that constrain the effectiveness of public science communication.
With Item Response Theory, one can assess scale reliability continuously along the range of the underlying latent disposition (DeMars 2010). Doing so for OSI_2.0, it can be seen that what CRT contributes to OSI_2.0’s measurement precision is concentrated at the very upper end of the range of the "ordinary science intelligence" aptitude:
This feature of CRT can be shown to make CRT less effective at what it is supposed to do—viz., predict individual differences in the disposition to resist over-reliance on heuristic processing.
The covariance problem is considered diagnostic of that sort of disposition (Stanovich 2009, 2011). Those vulnerable to over-reliance on heuristic processing tend to make snap judgments based on the relative magnitudes of the numbers in “cell A” and either “cell B” or “cell C” in a 2x2 contingency table or equivalent. Because they don't go to the trouble of comparing the ratio of A to B with the ratio of C to D, people draw faulty inferences about the significance of the information presented (Arkes & Harkness 1983).
As it should, CRT predicts resistance to this bias (Toplak, West & Stanovich 2011).
But not as well as OSI_2.0.
These are scatter plots of performance on the covariance problem (N = 400 or so) in relation to OSI_2.0 & CRT, respectively, w/ lowess regression plots superimposed.
The crook in profile of the OSI_2.0 plot compared to the flat, boring profile of CRT shows that the former has superior discrimination (that is, identifies in a more fine-grained way how differences in reasoning ability affect the probability of getting the right answer).
Relatedly, the interspersing of the color-coded observations on the OSI_2.0 scatter plot shows how CRT is dividing people into groups that are both under- & over-inclusive w/r/t to proficiencies that OSI_2.0 is sorting out more reliably.
Or more concretely still, if I had only CRT, then I'd predict that there is only a 40% probability that someone who is +1 on OSI_2.0-- just short of "1" on CRT -- would get the covariance problem correct, when in fact the probability such a person will get the right answer is about 60%.
Similarly, if I used CRT to predict how someone at +1.5 on OSI_2.0 is likely to do on the problem, I'd predict about a 50% probability of him or her selecting the correct response -- when in fact the probability of a correct response for that person is closer to 75%.
Essentially, I'm going to be as satisfied with CRT as I am in OSI_2.0 only if my interest is to predict performance of those who score either 2 or 3 on CRT -- the 90th percentile or above in a general population sample.
But as can be seen from the OSI_2.0 scatter plot, it’s simply not the case that there’s no variance in people’s vulnerability to this particular heuristic bias in the rest of the population. A measure that can't enable examination of how so substantial a fraction of the population thinks should really disappoing cognitive psychologists, assuming their goal is to study critical reasoning in human beings generally.
Now, it's absolutely no surprise that OSI_2.0 dominates CRT in this regard: the CRT items are all members of the OSI_2.0 scale, which comprises 18 items the covariance structure of which is c consistent with measurement of a unidimensional latent disposition. So of course it is going to be a more discerning measure of whatever it is CRT is itself measuring -- even if CRT_2.0 isn't faithfully measuring only that, as CRT presumably is.
But that’s the point: we need a “better” CRT—one that is as tightly focused as the current version on the construct the scale is supposed to measure but that gets at least as big a piece of the underlying disposition as OSI_2.0, Numeracy or other scales that outperform CRT in predicting resistance to heuristic biases.
For that, "CRT 2.0" is going to need not only more items but items that add information to the scale in the middle and lower levels of the disposition that CRT is assessing. IRT is much more suited for identifying such items than are the methods that those working on CRT scale development now seem to be employing.
I could certainly understand why a researcher might not want a scale with as many as 18 items.
But again IRT can help here: use it to develop a longer, comprehensive battery of such items, ones that cover a large portion of the range of the relevant disposition. Then administer an "adaptive testing" battery that uses strategically selected subsets of items to zero in on any individual test-taker’s location on the range of the measured “cognitive reflection” disposition (DeMars 2010). Presumably, no one would need to answer more than half dozen in order to enable a very precise measure of his or her proficiency -- assuming one has a good set of items in the adaptive testing battery.
Anyway, I just think it is obvious that researchers here can and should do better--and not just b/c MTurk workers have all learned at this point that the ball costs 5 cents!
Arkes, H.R. & Harkness, A.R. Estimates of Contingency Between Two Dichotomous Variables. J. Experimental Psychol. 112, 117-135 (1983).
Campitelli, G. & Gerrans, P. Does the cognitive reflection test measure cognitive reflection? A mathematical modeling approach. Memory & Cognition, 1-14 (2013).
Chandler, J., Mueller, P. & Paolacci, G. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior research methods 46, 112-130 (2014).
DeMars, C. Item response theory (Oxford University Press, Oxford ; New York, 2010).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
Kahan, D.M. "Ordinary Science Intelligence: A Science Comprehension Measure for Use in the Study of Science Communication, with Notes on "Belief in" Evolution and Climate Change. CCP Working Paper No. 112 (2014).
Liberali, J.M., Reyna, V.F., Furlan, S., Stein, L.M. & Pardo, S.T. Individual Differences in Numeracy and Cognitive Reflection, with Implications for Biases and Fallacies in Probability Judgment. Journal of Behavioral Decision Making (2011).
Stanovich, K.E. Rationality and the reflective mind (Oxford University Press, New York, 2011).
Stanovich, K.E. What intelligence tests miss: the psychology of rational thought (Yale University Press, New Haven, 2009).
Weller, J.A., Dieckmann, N.F., Tusler, M., Mertz, C., Burns, W.J. & Peters, E. Development and testing of an abbreviated numeracy scale: A Rasch analysis approach.