As the 14 billion readers of this blog are aware, I’ve been working for the last 37 years—making steady progress all the while—on developing a “public science comprehension measure” suited for use in the study of public risk perception and science communication.
The most recent version of the resulting scale—“Ordinary Science Intelligence 2.0” (OSI_2.0)—informs the study reported in Climate Science Communication and the Measurement Problem. That paper also presents the results of a proto— public climate-science comprehension instrument, the “Ordinary Climate Science Intelligence” (OCSI_0.01).
Both scales were developed and scored using Item Response Theory.
Since I’m stuck on an 18-hour flight to Australia & don’t have much else to do (shouldn’t we touch down in Macao or the Netherlands Antilles or some other place with a casino to refuel?!), I thought I’d post something (something pretty basic, but the internet is your oyster if you want more) on IRT and how cool it is.
Like other scaling strategies, IRT conceives of responses to questionnaire items as manifest or observable indicators of an otherwise latent or unobserved disposition or capacity. When the items are appropriately combined, the resulting scale will be responsive to the items’ covariance, which reflects their shared correlation with the latent disposition. At the same time, the scale will be relatively unaffected by the portions of variance in each item that are random in relation to the latent disposition and that should more or less cancel each out when the items are aggregated.
By concentrating the common signal associated with the items and muting the noise peculiar to each, the scale furnishes a more sensitive measure than any one item (DeVellis 2012).
While various scaling methods tend to differ in the assumptions they make about the relative strength or weight of individual items, nearly all treat items as making fungible contributions to measurement of the latent variable conceived of as some undifferentiated quantity that varies across persons.
IRT, in contrast, envisions the latent disposition as a graded continuum along which individuals can be arrayed. It models the individual items as varying in measurement precision across the range of that continuum, and weights the items appropriately in aggregating responses to them to form a scale (Embretson & Reise 2000).
The difference in these strategies will matter most when the point of making measurements is not simply to characterize the manner in the which the latent disposition (“cultural individualism,” say) varies relative to one or another individual characteristic within a sample (“global warming risk concern”) but to rank particular sample members (“law school applicants”) in relation to the disposition (“critical reasoning ability”).
In the former case, I’ll do fine with measures that enable me to sum up the “amount” of the disposition across groups and relate them to levels of some covariate of interest. But in the latter case I’ll also value measures that enable me to discriminate between varying levels of the disposition at all the various points where accurate sorting of the respondents or test takers matter to me.
IRT is thus far and away the dominant scaling strategy in the design and grading of standardized knowledge assessments, which are all about ranking individuals in relation to some aptitude or skill of interest.
Not surprisingly, then, if one is trying to figure out how to create a valid public science comprehension instrument, one can learn a ton from looking at the work of researchers who use IRT to construct standardized assessments.
Indeed, it’s weird to me, as I said in a previous post, that the development of pubic science comprehension instruments like the NSF Indicators (2014: ch. 7)—and research on public understanding of science generally—has made so little use of this body of knowledge.
I used IRT to help construct OSI_2.0.
Below are the “item response curves” of four OSI_2.0 items, calibrated to the ability level of a general population sample. The curves (derived via a form of logistic regression) plot the probability of getting the “correct” answer to the specified items in relation to the latent “ordinary science intelligence” disposition. (If you want item wording, check out the working paper.)
One can see the relative “difficulty” of these items by observing the location of their respective “response curves” in relation to the y-axis: the further to the right, the “harder” it is.
Accordingly, “Prob1_nsf,” one of the NSF Indicators “science methods” questions is by far the easiest: a test taker has to be about one standard deviation below the mean on OSI before he or she is more likely than not to get this one wrong.
“Cond_prob,” a Bayesian conditional probability item from the Lipkus/Peters Numeracy battery, is hardest: one has to have a total score two standard deviations above the mean before one has a better than 50% chance of getting this one right (why are conditional probability problems so hard? SENCER should figure out how to teach teachers to teach Bayes’s’ Theorem more effectively!).
“Copernicus_nsf,” the “earth around the sun or sun around the earth?” item, and “Widgets_CRT,” a Cognitive Reflection Test item, are in between.
It's because IRT scoring weights items in relation to their difficulty—and, if one desires, in relation to their “discrimination,” which refers to the steepeness of the item-response curve slope (the steeper the curve, the more diagnostic a correct response is to the disposition level of the respnodent)—that one can use it to gauge a scale's variable measurement precision across the range of the the relevant latent disposition.
All 4 of these OSI_2.0 items are reliable indicators of the latent disposition in question (if they weren’t, the curves would be flatter). But because they vary in difficulty, they generate more information about the relative level of OSI among heterogeneous test takers than would a scale that consisted, say, of four items of middling difficulty, not to mention four that were all uniformly easy or hard.
The figures illustrate the variable measurement precision of two instruments: the NSF Indicators battery, formed by combining its nine “factual knowledge” and three “science methods” items; and a long (10-item) version of Frederick’s Cognitive Reflection Test (Frederick 2005).
The “Test Information Curves” plotted in the left panel illustrate the relative measurement precision of each in relation to the latent dispositions each is measuring. Note, the disposition isn’t the same one for both scales; by plotting the curves on one graph, I am enabling comparative assessment of the measurement precision of the two instruments in relation to the distinct latent traits that they respectively assess.
Information” units are the inverse of the scale's measurement variance—a concept that I think isn’t particularly informative (as it were) for those who haven’t used IRT extensively enough to experience the kind of cognitive rewiring that occurs as one becomes proficient with a statistical tool.
So the right-hand panel conveys the same information for each assessment in the form of a variable “reliability coefficient.” It’s not the norm for IRT write-ups, but I think it’s easier for reflective people generally to grasp.
The reliability coefficient is conceptualized as the proportion of the variance in the observed score that can be attributed to variance in the "true score" or actual disposition levels of the examined subjects. A test that was perfectly reliable—that had no measurement error in relation to the latent disposition—would have a coefficient of 1.0.
Usually 0.7 is considered decent enough, although for “high stakes” testing like the SAT, 0.8 would probably be the lowest anyone would tolerate.
Ordinarily, when one is assessing the performance of a latent-variable scale, one would have a reliability coefficient—like Cronbach’s α, something I’ve mentioned now and again—that characterizes the measurement precision of the instrument as a whole.
But with IRT, the reliability coefficient is a continuous variable: one can compute it—and hence gauge the measurement precision of the instrument—at any specified point along the range of the latent disposition the instrument is measuring.
What one can see from the Figure, then, is that these two scales, while comparable in “reliability,” actually radically differ with respect to the levels of the latent disposition in relation to which they are meaningfully assessing individual differences.
The NSF Indicators battery is concentrating most of its discrimination within the space between -1.0 and -2.0 SDs. So it will do a really really good job in distinguishing people who are merely awful from those who outrageously awful.
You can be pretty confident that someone who scores above the mean on the test is at least average. But the measurement beyond that is so pervaded with error as to make it completely arbitrary to treat differences in scores as representing genuinely different levels in ability.
The test is just too darn easy!
This is one of the complaints that people who study public science comprehension have about the Indicators battery (but one they don’t voice nearly as often as they ought to).
The CRT has the opposite problem!
If you want to separate out Albert Einstein from Johnny von Neumann, you probably can with this instrument! (Actually, you will be able to do that only if “cognitive reflection” is the construct that corresponds to what makes them geniuses; that’s doubtful.) The long CRT furnishes a high degree of measurement reliability way out into the Mr. Spock Zone of +3 SDs, where only about .01% (as in “one hundredth of one percent”) of the human population (as in 1 person in 10,000) hangs out.
In truth, I can’t believe that there really is any value in distinguishing between levels of reflection beyond +2.0 (about the 98th percentile) if one is studying the characteristics of critical reasoning in the general population. Indeed, I think you can do just fine in investigating critical reasoning generally, as opposed to grading exams or assessing admissions applications etc., with an instrument that maintains its reliability out to 1.8 (96th percentile).
There’d be plenty of value for general research purposes, however, in being able to distinguish people whose cognitive reflection level is a respectable average from those whose level qualifies them as legally brain dead.
But you can’t with this instrument: there’s zero discrimination below the population mean.
Too friggin’ hard!
The 10-item battery was supposed to remedy this feature of the standard 3-item version but really doesn't do the trick—because the seven new items were all comparably difficult to the original three.
Now, take a look at this:
These are the test information and IRT reliability coefficients for OSI 2.0 as well as for each of the different sets of items it comprises.
The scale has its highest level of precision at about +1 SD, but has relatively decent reliability continuously from -2.0 to +2.0. It accomplishes that precisely because it combines sets of items that vary in difficulty. This is all very deliberate: using IRT in scale development made it possible to select an array of items from different measures to attain decent reliability across the range of the latent "ordinary science intelligence" disposition.
Is it “okay” to combine the measures this way? Yes, but only if it is defensible to understand them as measuring the same thing—a single, common latent disposition.
That’s a psychometric quality of a latent variable measurement instrument that IRT presupposes (or in any case, can’t itself definitively establish), so one uses different tools to assess that.
Factor analysis, the uses and abuses of which I’ve also discussed a bit before, is one method of investigating whether a set of indicators measure a single latent variable.
I’ve gone on too long—we are almost ready to land!—to say more about how it works (and how it doesn’t work if one has a “which button do I push” conception of statistics). But just to round things out, here is the output from a common-factor analysis (CFA) of OSI_2.0.
It suggests that a single factor or unobserved variable accounts for 87% of the variance in responses to the items, as compared to a residual second factor that explains another 7%. That’s pretty strong evidence that treating OSI_2.0 as a “unidimensional” scale—or a measure of a single latent disposition—is warranted.
At this point, the only question is whether what it is measuring is really “ordinary science intelligence,” or the combination of knowledge, motivations, and reasoning dispositions that I’m positing enable an ordinary citizen to recognize and give property effect to valid scientific evidence in ordinary decisionmaking contexts.
That’s a question about the “external validity” of the scale.
I say something about that, too, in “ ‘Ordinary Science Intelligence’: A Science Comprehension Measure for Use in the Study of Risk Perception and Science Communication,” CCP Working Paper No. 112.
I won’t say more now (they just told us to turn off electronic devices. . .) except to note that to me one of the most interesting questions is whether OSI_2.0 is a measure of ordinary science intelligence or simply a measure of intelligence.
A reflective commentator put this question to me. As I told him/her, that’s a challenging issue, not only for OSI_2.0 but for all sorts of measures that purport to be assessing one or another critical reasoning proficiency . . . .
Holy smokes--is that George Freeman?!
DeVellis, R.F. Scale development : theory and applications (SAGE, Thousand Oaks, Calif., 2012).
Embretson, S.E. & Reise, S.P. Item response theory for psychologists (L. Erlbaum Associates, Mahwah, N.J., 2000).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
National Science Foundation. Science and Engineering Indicators (Wash. D.C. 2014).