Covers familiar ground for the 14.6 billion regular readers of this blog, but for the benefit of the 2 or so billion nonregulars who tune in on a given day here is a portion of the Measurement Problem paper exposing the invalidity of the NSF Science Indicators’ “evolution” measure. What is obsessing and confounding me — as I indicated in the recent “What exactly is going on in their heads?” post–is how to understand and make sense of the perspective of the “knowing disbeliever”: in that context, the individual who displays high comprehension of the mechanisms and consequences of human-caused climate change but “disbelieves it”; here, the bright student who (unlike the vast majority of people who say they “believe in” evolution) displays comprehension of the modern synthesis, and who might well go on to be a scientist or other professional who uses such knowlege, but who nevertheless “disbelieves” evolution. . . .
2. What does “belief in evolution” measure?
But forget climate change for a moment and consider instead another controversial part of science: the theory of evolution. Around once a year, Gallup or another major commercial survey firm releases a poll showing that approximately 45% of the U.S. public rejects the proposition that human beings evolved from another species of animal. The news is inevitably greeted by widespread expressions of dismay from media commentators, who lament what this finding says about the state of science education in our country.
Actually, it doesn’t say anything. There are many ways to assess the quality of instruction that U.S. students receive in science. But what fraction of them say they “believe” in evolution is not one of them.
Numerous studies have found that profession of “belief” in evolution has no correlation with understanding of basic evolutionary science. Individuals who say they “believe” are no more likely than those who say they “don’t” to give the correct responses to questions pertaining to natural selection, random mutation, and genetic variance—the core elements of the modern synthesis (Shtulman 2006; Demastes, Settlage & Good 1995; Bishop & Anderson 1990).
Nor can any valid inference be drawn about a U.S. survey respondent’s profession of “belief” in human evolution and his or her comprehension of science generally. The former is not a measure of the latter.
To demonstrate this point requires a measure of science comprehension. Since Dewey (1910), general education has been understood to have the aim of imparting the capacity to recognize and use pertinent scientific information in ordinary decisionmaking—personal, professional, and civic (Baron 1993). Someone who attains this form of “ordinary science intelligence” will no doubt have acquired knowledge of a variety of important scientific findings. But to expand and use what she knows, she will also have to possesses certain qualities of mind: critical reasoning skills essential to drawing valid inferences from evidence; a faculty of cognitive perception calibrated to discerning when a problem demands such reasoning; and the intrinsic motivation to perform the effortful information processing such analytical tasks entail (Stanovich 2011).
The aim of a valid science comprehension instrument is to measure these attributes. Rather than certifying familiarity with some canonical set of facts or abstract principles, we want satisfactory performance on the instrument to vouch for an aptitude comprising the “ordinary science intelligence” combination of knowledge, skills, and dispositions.
Such an instrument can be constructed by synthesizing items from standard “science literacy” and critical reasoning measures (cf. Kahan, Peters et. al 2012). These include the National Science Foundation’s Science Indicators (2014) and Pew Research Center’s “Science and Technology” battery (2013), both of which emphasize knowledge of core scientific propositions from the physical and biological sciences; the Lipkus/Peters Numeracy scale, which assesses quantitative reasoning proficiency (Lipkus et al. 2001; Peters et al. 2006; Weller et al. 2012); and Frederick’s Cognitive Reflection Test, which measures the disposition to consciously interrogate intuitive or pre-existing beliefs in light of available information (Frederick 2005; Kahneman 1998).
The resulting 18-item “Ordinary Science Intelligence” scale is highly reliable (α = 0.83) and displays a unidimensional covariance structure when administered to a representative general population sample (N = 2000). Scored with Item Response Theory to enhance its discrimination across the range of the underlying latent (not directly observable) aptitude that it can be viewed as measuring, OSI strongly predicts proficiency on tasks such as covariance detection, a form of reasoning elemental to properly drawing casual inferences from data (Stanovich 2009). It also correlates (r = 0.40, p < 0.01) with Baron’s Actively Open-minded Thinking test, which measures a person’s commitment to applying her analytical capacities to find and properly interpret evidence (Haron, Ritov & Mellers 2013; Baron 2008).
Consistent with the goal of discerning differing levels of this proficiency (Embretson & Reise 2000), OSI contains items that span a broad range in difficulty. For example, the NSF Indicator Item “Electrons”—“Electrons are smaller than atoms—true or false?”—is comparatively easy (Figure 1). Even at the mean level of science comprehension, test-takers from a general population sample are approximately 70% likely to get the “right” answer. Only someone a full standard deviation below the mean is more likely than not to get it wrong.
“Nitrogen,” the Pew multiple choice item on which gas is most prevalent in the atmosphere, is relatively difficult (Figure 1). Someone with a mean OSI score is only about 20% likely to give the correct response. A test taker has to possess an OSI aptitude one standard deviation above the mean before he or she is more likely than not to supply the correct response.
“Conditional Probability” is a Numeracy battery item (Weller et al. 2012). It requires a test-taker to determine the probability that a woman who is selected randomly from the population and who tests positive for breast cancer in fact has the disease; to do so, the test-taker must appropriately combine information about the population frequency of breast cancer with information about the accuracy rate of the screening test. A problem that assesses facility in drawing the sort of inferences reflecting the logic of Bayes’s’ Theorem, Conditional Probability turns out to be super hard. At the mean level of OSI, there is virtually no chance a person will get this one right. Even those over two standard deviations above the mean are still no more likely to get it right than to get it wrong (Figure 1).
With this form of item response analysis (Embretson & Reise 2000), we can do two things. One is identify invalid items—ones that don’t genuinely measure the underlying disposition in an acceptably discerning manner. We’ll recognize an invalid item if the probability of answering it correctly doesn’t bear the sort of relationship with OSI that valid items do.
The NSF Indicator’s “Evolution” item—“human beings, as we know them today, developed from earlier species of animals, true or false?”—is pretty marginal in that regard. People who vary in science comprehension, we’ve seen, vary correspondingly in their ability to answer questions that pertain to their capacity to recognize and give effect to valid empirical evidence. The probability of getting the answer “right” on “Evolution,” in contrast, varies relatively little across the range of OSI (Figure 1). In addition, the probability of getting the right answer is relatively close to 50% at both one standard deviation below and one standard deviation above the OSI mean, as well as at every point in between. The relative unresponsiveness of the item to differences in science comprehension, then, is reason to infer that it is either not measuring anything or is measuring something that is independent of science comprehension.
Second, item-response functions can be used to identify items that are “biased” in relation to a subgroup. “Bias” in this context is used not in its everyday moral sense, in which it connotes animus, but rather in its measurement sense, where it signifies a systematic skew toward either high or low readings in relation to the quantity being assessed. If an examination of an item’s response profile shows that it tracks the underlying latent disposition in one group but not in another, then that item is biased in relation to members of the latter group—and thus not a valid measure of the disposition for a test population that includes them (Osterlind & Everson 2009).
That’s clearly true for the NSF’s Evolution item as applied to individuals who are relatively religious. Such individuals—who we can identify with a latent disposition scale that combines self-reported church attendance, frequency of prayer, and perceived importance of religion in one’s life (α = 0.86)—respond the same as relatively nonreligious ones with respect to Electron, Nitrogen, and Conditional Probability. That is, in both groups, the probability of giving the correct response varies in the same manner with respect to the underlying science comprehension disposition that OSI measures (Figure 2).
Their performance on the Evolution item, however, is clearly discrepant. One might conclude that Evolution is validly measuring science comprehension for non-religious test takers, although in that case it is a very easy question: the likelihood a nonreligious individual with a mean OSI score will get the “right” answer is 80%—even higher than the likelihood that this person would respond correctly to the relatively simple Electron item.
In contrast, for a relatively religious individual with a mean OSI score, the probability of giving the correct response is around 30%. This 50 percentage-point differential tells us that Evolution does not have the same relationship to the latent OSI disposition in these two groups.
Indeed, it is obvious that Evolution has no relation to OSI whatsoever in relatively religious respondents. For such individuals, the predicted probability of giving the correct answer does not increase as individuals display a higher degree of science comprehension. On the contrary, it trends slightly downward, suggesting that religious individuals highest in OSI are even more likely to get the question “wrong.”
It should be obvious but just to be clear: these patterns have nothing to do with any correlation between OSI and religiosity. There is in fact a modest negative correlation between the two (r = -0.17, p < 0.01). But the “differential item function” test (Osterlind & Everson 2009) I’m applying identifies differences among religious and nonreligious individuals of the same OSI level. The difference in performance on the item speaks to the adequacy of Evolution as a measure of knowledge and reasoning capacity and not to the relative quality of those characteristics among members of the two groups.
The bias with respect to religious individuals—and hence the invalidity of the item as a measure of OSI for a general population sample—is most striking in relation to respondents’ performance on Conditional Probability. There is about a 70% (± 10 percentage points, at the 0.95 level of confidence) probability that someone two and a quarter standard deviations above the mean on OSI will answer this extremely difficult question correctly. Of course, there aren’t many people two and a quarter standard deviations above the mean (the 99th percentile), but certainly they do exist, and they are not dramatically less likely to be above average in religiosity. Yet if one of these exceptionally science-comprehending individuals is relatively religious, the probability that he or she will give the right answer to the NSF Evolution item is about 25% (± 10 percentage points, at the 0.95 level of confidence)—compared to 80% for the moderately nonreligious person who is merely average in OSI and whose probability of answering Conditional Probability correctly is epsilon.
Under these conditions, one would have to possess a very low OSI score (or a very strong unconscious motivation to misinterpret these results (Kahan, Peters, et al. 2013)) to conclude that a “belief in evolution” item like the one in the NSF Indicatory battery validly measures science comprehension in general population test sample. It is much more plausible to view it as measuring something else: a form of cultural identity that either does or does not feature religiosity (cf. Roos 2012).
One way to corroborate this surmise is to administer to a general population sample a variant of the NSF’s Evolution item designed to disentangle what a person knows about science from who he or she is culturally speaking. When the clause, “[a]ccording to the theory of evolution . . .” introduces the proposition “human beings, as we know them today, developed from earlier species of animals” (NSF 2006, 2014), the discrepancy between relatively religious and relatively non-religious test-takers disappears! Freed from having to choose between conveying what they understand to be the position of science and making a profession of “belief” that denigrates their identities, religious test-takers of varying levels of OSI now respond very closely to how nonreligious ones of corresponding OSI levels do. The profile of the item response curve—a positive slope in relation to OSI for both groups—supports the inference that answering this variant of Evolution correctly occupies the same relation to OSI as do the other items in the scale. However, this particular member of the scale turns out to be even easier—even less diagnostic of anything other than a dismally low comprehension level in those who get it wrong—than the simple NSF Indicator Electron item.
As I mentioned, there is no correlation between saying one “believes” in evolution and meaningful comprehension of natural selection and the other elements of the modern synthesis. Sadly, the proportion who can give a cogent and accurate account of these mechanisms is low among both “believers” and “nonbelievers,” even in highly educated samples, including college biology students (Bishop & Anderson 1990). Increasing the share of the population that comprehends these important—indeed, astonishing and awe-inspiring—scientific insights is very much a proper goal for those who want to improve the science education that Americans receive.
The incidence of “disbelief” in evolution in the U.S. population, moreover, poses no barrier to attaining it. This conclusion, too, has been demonstrated by outstanding empirical research in the field of education science (Lawson & Worsnop 2006). The most effective way to teach the modern synthesis to high school and college students who “do not believe” in evolution, this research suggests, is to focus on exactly the same thing one should focus on to teach evolutionary science to those who say they do “believe” but very likely don’t understand it: the correction of various naive misconceptions that concern the tendency of people to attribute evolution not to supernatural forces but to functionalist mechanisms and to the hereditability of acquired traits (Demastes, Settlage & Good 1995; Bishop & Anderson 1990)..
Not surprisingly, the students most able to master the basic elements of evolutionary science are those who demonstrate the highest proficiency in the sort of critical reasoning dispositions on which science comprehension depends. Yet even among these students, learning the modern synthesis does not make a student who started out professing “not to believe in” evolution any more likely to say she now does “believe in” it (Lawson & Worsnop 2006).
Indeed, treating profession of “belief” as one of the objectives of instruction is thought to make it less likely that students will learn the modern synthesis. “[E]very teacher who has addressed the issue of special creation and evolution in the classroom,” the authors of one study (Lawson & Worsnop 2006, p. 165) conclude,
already knows that highly religious students are not likely to change their belief in special creation as a consequence of relative brief lessons on evolution. Our suggestion is that it is best not to try to [change students’ beliefs], not directly at least. Rather, our experience and results suggest to us that a more prudent plan would be to utilize instruction time, much as we did, to explore the alternatives, their predicted consequences, and the evidence in a hypothetico-deductive way in an effort to provoke argumentation and the use of reflective thought. Thus, the primary aims of the lesson should not be to convince students of one belief or another, but, instead, to help students (a) gain a better understanding of how scientists compare alternative hypotheses, their predicated consequences, and the evidence to arrive at belief and (b) acquire skill in the use of this important reasoning pattern—a pattern that appears to be necessary for independent learning and critical thought.
This research is to the science of science communication’s “measurement problem” what the double slit experiment is to quantum mechanics’. All students, including the ones most readily disposed to learn science, can be expected to protect their cultural identities from the threat that denigrating cultural meanings pose to it. But all such students—all of them—can also be expected to use their reasoning aptitudes to acquire understanding of what is known to science. They can and will do both—at the very same time. But only when the dualistic quality of their reason as collective-knowledge acquirers and identity-protectors is not interfered with by forms of assessment that stray from science comprehension and intrude into the domain of cultural identity and expression. A simple (and simple-minded) test can be expected to force disclosure of only one side of their reason. And what enables the most exquisitely designed course to succeed in engaging the student’s reason as an acquirer of collective knowledge is exactly the care and skill with which the educator avoids provoking the student into using her reason for purposes of identity-protection only.
 The items comprising the OSI scale appear in the Appendix. The psychometric performance of the OSI scale is presented in greater detail in Kahan (2014).
Baron, J. Why Teach Thinking? An Essay. Applied Psychology 42, 191-214 (1993).
Bishop, B.A. & Anderson, C.W. Student conceptions of natural selection and its role in evolution. Journal of Research in Science Teaching 27, 415-427 (1990).
Demastes, S.S., Settlage, J. & Good, R. Students’ conceptions of natural selection and its role in evolution: Cases of replication and comparison. Journal of Research in Science Teaching 32, 535-550 (1995).
Dewey, J. Science as Subject-matter and as Method. Science 31, 121-127 (1910).
Embretson, S.E. & Reise, S.P. Item response theory for psychologists (L. Erlbaum Associates, Mahwah, N.J., 2000).
Kahan, D.M. “Ordinary Science Intelligence”: A Science Comprehension Measure for Use in the Study of Risk Perception and Science Communication. Cultural Cognition Project Working Paper No. 112 (2014).
Kahan, D.M., Peters, E., Dawson, E. & Slovic, P. Motivated Numeracy and Englightened Self Government. Cultural Cognition Project Working Paper No. 116 (2013).
Kahan, D.M., Peters, E., Wittlin, M., Slovic, P., Ouellette, L.L., Braman, D. & Mandel, G. The polarizing impact of science literacy and numeracy on perceived climate change risks. Nature Climate Change 2, 732-735 (2012).
Lawson, A.E. & Worsnop, W.A. Learning about evolution and rejecting a belief in special creation: Effects of reflective reasoning skill, prior knowledge, prior belief and religious commitment. Journal of Research in Science Teaching 29, 143-166 (1992).
Lipkus, I.M., Samsa, G. & Rimer, B.K. General Performance on a Numeracy Scale among Highly Educated Samples. Medical Decision Making 21, 37-44 (2001).
National Science Foundation. Science and Engineering Indicators (Wash. D.C. 2014).
National Science Foundation. Science and Engineering Indicators (Wash. D.C. 2006).
Osterlind, S.J., Everson, H.T. & Osterlind, S.J. Differential item functioning (SAGE, Thousand Oaks, Calif., 2009).
Peters, E., Västfjäll, D., Slovic, P., Mertz, C.K., Mazzocco, K. & Dickert, S. Numeracy and Decision Making. Psychol Sci 17, 407-413 (2006).
Pew Research Center for the People & the Press. Public’s Knowledge of Science and Technology (Pew Research Center, Washington D.C., 2013).
Roos, J.M. Measuring science or religion? A measurement analysis of the National Science Foundation sponsored science literacy scale 2006–2010. Public Understanding of Science (2012).
Shuman, H. Interpreting the Poll Results Better. Public Perspective 1, 87-88 (1998).
Stanovich, K.E. What intelligence tests miss : the psychology of rational thought (Yale University Press, New Haven, 2009).
Weller, J.A., Dieckmann, N.F., Tusler, M., Mertz, C., Burns, W.J. & Peters, E. Development and testing of an abbreviated numeracy scale: A rasch analysis approach. Journal of Behavioral Decision Making 26, 198-212 (2012).