Key Insight

Covers familiar ground for the 14.6 billion regular readers of this blog, but for the benefit of the 2 or so billion nonregulars who tune in on a given day here is a portion of the Measurement Problem paper exposing the invalidity of the NSF Science Indicators’ “evolution” measure.  What is obsessing and confounding me — as I indicated in ... Read more

Covers familiar ground for the 14.6 billion regular readers of this blog, but for the benefit of the 2 or so billion nonregulars who tune in on a given day here is a portion of the Measurement Problem paper exposing the invalidity of the NSF Science Indicators’ “evolution” measure.  What is obsessing and confounding me — as I indicated in the recent “ What exactly is going on in their heads? ” post–is how to understand and make sense of the perspective of the “knowing disbeliever”: in that context , the individual who displays high comprehension of the mechanisms and consequences of human-caused climate change but “disbelieves it”; here, the bright student who (unlike the vast majority of people who say they “believe in” evolution) displays comprehension of the modern synthesis, and who might well go on to be a scientist or other professional who uses such knowlege, but who nevertheless “disbelieves” evolution. . . .

2.  What does “belief in evolution” measure?

But forget climate change for a moment and consider instead another controversial part of science: the theory of evolution. Around once a year, Gallup or another major commercial survey firm releases a poll showing that approximately 45% of the U.S. public rejects the proposition that human beings evolved from another species of animal. The news is inevitably greeted by widespread expressions of dismay from media commentators, who lament what this finding says about the state of science education in our country.

Actually, it doesn’t say anything. There are many ways to assess the quality of instruction that U.S. students receive in science.  But what fraction of them say they “believe” in evolution is not one of them.

Numerous studies have found that profession of “belief” in evolution has no correlation with understanding of basic evolutionary science. Individuals who say they “believe” are no more likely than those who say they “don’t” to give the correct responses to questions pertaining to natural selection, random mutation, and genetic variance—the core elements of the modern synthesis (Shtulman 2006; Demastes, Settlage & Good 1995; Bishop & Anderson 1990).

Nor can any valid inference be drawn about a U.S. survey respondent’s profession of “belief” in human evolution and his or her comprehension of science generally.  The former is not a measure of the latter.

To demonstrate this point requires a measure of science comprehension.  Since Dewey (1910), general education has been understood to have the aim of imparting the capacity to recognize and use pertinent scientific information in ordinary decisionmaking—personal, professional, and civic (Baron 1993).  Someone who attains this form of “ordinary science intelligence” will no doubt have acquired knowledge of a variety of important scientific findings.  But to expand and use what she knows, she will also have to possesses certain qualities of mind: critical reasoning skills essential to drawing valid inferences from evidence; a faculty of cognitive perception calibrated to discerning when a problem demands such reasoning; and the intrinsic motivation to perform the effortful information processing such analytical tasks entail (Stanovich 2011).

The aim of a valid science comprehension instrument is to measure these attributes.  Rather than certifying familiarity with some canonical set of facts or abstract principles, we want satisfactory performance on the instrument to vouch for an aptitude comprising the “ordinary science intelligence” combination of knowledge, skills, and dispositions.

Such an instrument can be constructed by synthesizing items from standard “science literacy” and critical reasoning measures (cf. Kahan, Peters et. al 2012). These include the National Science Foundation’s Science Indicators (2014) and Pew Research Center’s “Science and Technology” battery (2013), both of which emphasize knowledge of core scientific propositions from the physical and biological sciences; the Lipkus/Peters Numeracy scale, which assesses quantitative reasoning proficiency (Lipkus et al. 2001; Peters et al. 2006; Weller et al. 2012); and Frederick’s Cognitive Reflection Test, which measures the disposition to consciously interrogate intuitive or pre-existing beliefs in light of available information (Frederick 2005; Kahneman 1998).

The resulting 18-item “Ordinary Science Intelligence” scale is highly reliable (α = 0.83) and displays a unidimensional covariance structure when administered to a representative general population sample ( N = 2000). [1] Scored with Item Response Theory to enhance its discrimination across the range of the underlying latent (not directly observable) aptitude that it can be viewed as measuring, OSI strongly predicts proficiency on tasks such as covariance detection , a form of reasoning elemental to properly drawing casual inferences from data (Stanovich 2009).  It also correlates ( r = 0.40, p < 0.01) with Baron’s Actively Open-minded Thinking test, which measures a person’s commitment to applying her analytical capacities to find and properly interpret evidence (Haron, Ritov & Mellers 2013; Baron 2008).

Consistent with the goal of discerning differing levels of this proficiency (Embretson & Reise 2000), OSI contains items that span a broad range in difficulty.  For example, the NSF Indicator Item “Electrons”—“Electrons are smaller than atoms—true or false?”—is comparatively easy (Figure 1). Even at the mean level of science comprehension, test-takers from a general population sample are approximately 70% likely to get the “right” answer.  Only someone a full standard deviation below the mean is more likely than not to get it wrong.

“Nitrogen,” the Pew multiple choice item on which gas is most prevalent in the atmosphere, is relatively difficult (Figure 1).  Someone with a mean OSI score is only about 20% likely to give the correct response. A test taker has to possess an OSI aptitude one standard deviation above the mean before he or she is more likely than not to supply the correct response.

“Conditional Probability” is a Numeracy battery item (Weller et al. 2012). It requires a test-taker to determine the probability that a woman who is selected randomly from the population and who tests positive for breast cancer in fact has the disease; to do so, the test-taker must appropriately combine information about the population frequency of breast cancer with information about the accuracy rate of the screening test. A problem that assesses facility in drawing the sort of inferences reflecting the logic of Bayes’s’ Theorem, Conditional Probability turns out to be super hard. At the mean level of OSI, there is virtually no chance a person will get this one right.  Even those over two standard deviations above the mean are still no more likely to get it right than to get it wrong (Figure 1).

With this form of item response analysis (Embretson & Reise 2000), we can do two things. One is identify invalid items—ones that don’t genuinely measure the underlying disposition in an acceptably discerning manner. We’ll recognize an invalid item if the probability of answering it correctly doesn’t bear the sort of relationship with OSI that valid items do.

The NSF Indicator’s “Evolution” item—“human beings, as we know them today, developed from earlier species of animals, true or false?”—is pretty marginal in that regard. People who vary in science comprehension, we’ve seen, vary correspondingly in their ability to answer questions that pertain to their capacity to recognize and give effect to valid empirical evidence. The probability of getting the answer “right” on “Evolution,” in contrast, varies relatively little across the range of OSI (Figure 1). In addition, the probability of getting the right answer is relatively close to 50% at both one standard deviation below and one standard deviation above the OSI mean, as well as at every point in between. The relative unresponsiveness of  the item to differences in science comprehension, then, is reason to infer that it is either not measuring anything or is measuring something that is independent of science comprehension.

Second, item-response functions can be used to identify items that are “biased” in relation to a subgroup.  “Bias” in this context is used not in its everyday moral sense, in which it connotes animus, but rather in its measurement sense, where it signifies a systematic skew toward either high or low readings in relation to the quantity being assessed.  If an examination of an item’s response profile shows that it tracks the underlying latent disposition in one group but not in another, then that item is biased in relation to members of the latter group—and thus not a valid measure of the disposition for a test population that includes them (Osterlind & Everson 2009).

That’s clearly true for the NSF’s Evolution item as applied to individuals who are relatively religious.  Such individuals—who we can identify with a latent disposition scale that combines self-reported church attendance, frequency of prayer, and perceived importance of religion in one’s life (α = 0.86)—respond the same as relatively nonreligious ones with respect to Electron, Nitrogen, and Conditional Probability. That is, in both groups, the probability of giving the correct response varies in the same manner with respect to the underlying science comprehension disposition that OSI measures (Figure 2).

Their performance on the Evolution item, however, is clearly discrepant. One might conclude that Evolution is validly measuring science comprehension for non-religious test takers, although in that case it is a very easy question:  the likelihood a nonreligious individual with a mean OSI score will get the “right” answer is 80%—even higher than the likelihood that this person would respond correctly to the relatively simple Electron item.

In contrast, for a relatively religious individual  with a mean OSI score, the probability of giving the correct response is around 30% .  This 50 percentage-point differential tells us that Evolution does not have the same relationship to the latent OSI disposition in these two groups.

Indeed, it is obvious that Evolution has no relation to OSI whatsoever in relatively religious respondents.  For such individuals, the predicted probability of giving the correct answer does not increase as individuals display a higher degree of science comprehension. On the contrary, it trends slightly downward, suggesting that religious individuals highest in OSI are even more likely to get the question “wrong.”