This one at Annenberg Public Policy Center last week, to discuss progress in one of our collaborative initiatives: evidence-based science documentary filmmaking.
This one at Annenberg Public Policy Center last week, to discuss progress in one of our collaborative initiatives: evidence-based science documentary filmmaking.
It never fails! My own best efforts (here & here) to explain the startling and increasingly notorious paper by Miller & Sanjurjo have prompted the authors to step forward and try to restore the usual state of perfect comprehension enjoyed by the 14.3 billion regular readers of this blog. They have determined, in fact, that it will take three separate guest posts to undo the confusion, so apparently I've carried out my plan to a [GV]T.
As cool as the result of the M&S paper is, I myself remain fascinated by what it tells us about cognition, particularly among those with exquisitely fine-tuned statistical intuitions. How did the analytical error they uncovered in the classic "hot hand fallacy" studies remain undetected for some thirty years, and why does it continue to provoke stubborn resistance on the part of very very smart people?? To Miller & Sanjurjo's credit, they have happily and persistently shouldered the immense burden of explication necessary to break the grip of the pesky intuition that their result "just can't be right!"
Here’s our plan for the upcoming 3 posts:
In the seminal hot hand fallacy paper, Gilovich, Vallone and Tversky (1985; “GVT”, also see the 1989 Tversky & Gilovich “Cold Facts” summary paper) set out to conduct a truly informative scientific test of hot hand shooting. After studying two types of in game shooting data, they conducted a controlled shooting study (experiment) with the Cornell University men’s and women’s basketball teams. This was an effective "...method for eliminating the effects of shot selection and defensive pressure" that were present as confounds in their analysis of game data (we will return to the issue of game data in a follow up post; for now click to the first page of Dixit & Nalebuff’s 1991 classic book “Thinking Strategically”, and this comment on Andrew Gelman’s blog). While the common use of the term “hot hand” shooting is vague and complex, everybody agrees that it refers to a temporary elevation in a player’s ability, i.e. the probability of a successful shot. Because hot state is unobservable to the researcher (perhaps not the player/teammate/coach!), we cannot simply measure a player’s probability of success in the hot state; we need an operational definition. A natural idea is to take a streak of sufficient length as a good signal of whether or not a player is in the hot state, and define a player as having the hot hand if his/her probability of success is greater after a streak of successful shots (hits), than after a streak of unsuccessful shots (misses). GVT designed a test for this.
Suppose we wanted to test whether Stephen Curry has the hot hand; how would we apply GVT’s test to Curry? The answer is that we would have Curry attempt 100 shots at locations from which he is expected to have a 50% chance of success (like a coin). Next, we would calculate Curry’s field goal percentage on those shots that immediately follow a streak of successful shots (hits), and test whether it is bigger than his field goal percentage on those shots that immediately follow a streak of unsuccessful shots (misses); the larger the difference that we observe, the stronger the evidence of the hot hand. GVT performed this test on the Cornell players, and found that this difference in field goal percentages was statistically significant for only one of the 26 players (two sample t-test), which is consistent with the chance variation that the coin model predicts.
Now, one can ask oneself: if Stephen Curry doesn’t get hot, that is, for each of his 100 shot attempts he has exactly a 50% chance of hitting his next shot, then what would I expect his field goal percentage to be when he is on a streak of three (or more) hits? Similarly, what would I expect his field goal percentage to be when he is on a streak of three (or more) misses?
Following GVT’s analysis, one can form two groups of shots:
Group “3hits”: all shots in which the previous three shots (or more) were a hit,
Group “3misses”: all shots in which the previous three shots (or more) were a miss,
From here, it is natural to reason as follows: if Stephen Curry always has the same chance of success, then he is like a coin, so we can consider each group of shots as independent; after all, each shot has been assigned at random either to one of three groups: “3hits,” “3misses,” or neither. So far this reasoning is correct. Now, GVT (implicitly) took this intuitive reasoning one step further: because all shots, which are independent, have been assigned at random to each of the groups, we should expect the field goal percentages to be the same in each group. This is the part that is wrong.
Where does this seemingly fine thinking go wrong? The first clue that there is a problem is that the variable that is being used to assign shots to groups is also showing up as a response variable in the computation of the field goal percentage, though this does not fully explain the problem. The key issue is that there is a bias in how shots are being selected for each group. Let’s see this by first focusing on the “3hits” group. Under the assumptions of GVT’s statistical test, Stephen Curry has a 50% chance of success on each shot, i.e. he is like a coin: heads for hit, and tails for miss. Now, suppose we plan on flipping a coin 100 times, then selecting at random among the flips that are immediately preceded by three consecutive heads, and finally checking to see if the flip we selected is a heads, or a tails. Now, before we flip, what is the probability that the flip we end up selecting is a heads? The answer is that this probability is not 0.50, but 0.46! Herein lies the selection bias. The flips that are being selected for analysis are precisely the flips that are immediately preceded by three consecutive heads. Now, returning to the world of basketball shots, this way of selecting shots for analysis implies that for the “3hits” group, there would be a 0.46 chance that the shot we are selecting is a hit, and for the “3misses” group, there would be a 0.54 chance that the shot we are selecting is a hit.
Therefore, if Stephen Curry does not get hot, i.e. if he always has a 50% chance of success for the 100 shots we study, we should expect him to shoot 46% after a streak of three or more hits, and 54% after a streak of three or more misses. This is the order of magnitude of the bias that was built into the original hot hand study, and this is the bias that is depicted in Figure 2 on page 13 of our new paper, and a simpler version of this figure is below. This bias is large in basketball terms: a difference of more than 8 percentage points is nearly the difference between the median NBA Three Point shooter, and the very best. Another way to look at this bias is to imagine what would happen if we were to invite 100 players to participate in GVT’s experiment, with each player shooting from positions in which the chance of success on each shot were 50%. For each player check to see if his/her field goal percentage after a streak of three or more hits is higher than his/her field goal percentage after a streak of three or more misses. For how many players should we expect this to be true? Correct answer: 40 out of 100 players.
This selection bias is large enough to invalidate the main conclusion of GVT's original study, without having to analyze any data. However, beyond this “negative” message, there is also a way forward. Namely, we can re-analyze the original Cornell dataset, but in a way invulnerable to the bias. It turns out that when we do this, we find considerable evidence of the hot hand in this data. First, if we look at Table 4 in GVT (page 307), we see that, on average, players shot around 3.5 percentage points better when on a hit streak of three or more shots, and that 64% of the players shot better when on a hit streak than when on a miss streak. While GVT do not directly analyze these summary averages, given our knowledge of the bias, they are telling (in fact, you can do much more with Table 4; see Kenny LJ respond to his own question here). With the correct analysis (described in the next post), there is statistically significant evidence of the hot hand in the original data set, and, as can be seen in Table 2 on page 23 of our new paper, the point estimate of the average hot hand effect size is large (further details in our “Cold Shower” paper here). If one adjusts for the bias, what one now finds is that: (1) hitting a streak of three or more shots in a row is associated with an expected 10 percentage points boost in a player’s field goal percentage, (2) 76% of players have a higher field goal percentage when on a hit vs. miss streak, (3) and 4 out of 26 players have a large enough effect to be individually significant by conventional statistical standards (p<.05), which itself is a statistically significant result on the number of significant effects, by conventional standards.
In a later post, we will return to the details of GVT’s paper, and talk about the evidence for the hot hand found across other datasets. If you prefer not to wait, please take a look at our Cold Shower paper, and related comments on Gelman’s blog).
In the next installment, we will discuss the counter-intuitive probability problem that reveals the bias, and explain what is driving the selection bias there. We will then discuss some common misconceptions about the nature of the selection bias, and some very interesting connections with classic probability paradoxes.
Reports on road shows:
This is the future of the Liberal Republic of Science: a society filled with culturally diverse citizens whose common interest in enjoying the benefit of all the knowledge their way of life makes possible is secured by scientists, science communication professionals, educators, and public officials using and extending the "new political science" of science communication.
I did presentation on "'Ideology' or 'Situation Sense?'," the CCP study on interaction of cultural worldviews and legal reasoning in public, law students, lawyers & judges, respectively. Lots of great feedback.
A small selection of other papers definitely worth taking look at (very frustrating element of a conference like this is having to choose between concurrent sessions featuring really interesting stuff):
Chen, Moskowitz & Shue, Decision-Making Under the Gambler's Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires
Thorley, Green et al., Please Recuse Yourself: A Field Experiment Exploring the Relationship between Campaign Donations and Judicial Recusal
MacDonald, Fagan & Geller, The Effects of Local Police Surges on Crime and Arrests in New York City
Ramsayer, Nuclear Power and the Mob: Extortion and Social Capital in Japan
Scurich, Jurors’ Presumption of Innocence: Impact on Cumulative Evidence Evaluation and Verdicts
Sommers, Perplexing Public Attitudes Toward Consent: Implications for Sex, Law, and Society
Robertson, 535 Felons? An Empirical Investigation into the Law of Political Corruption
Baker & Malani, Do Judges Really Care About Law? Evidence from Circuit Split Data
Was just reading a really cool article, Aasen, M. The polarization of public concern about climate change in Norway., Climate Policy (2015), advance online publication.
Constructing Individualism and Egalitarian scales with items from Norwegian Gallup polls conducted between 2003-11, Aasen does find that both dispositions predict differences in concern w/ climate change -- less for former, more for latter.
Climate change concern was measured with the single item ‘How concerned are you about climate change?’ The response categories were ‘Quite concerned’, ‘Very concerned’, ‘A little concerned’, and ‘Not at all concerned.'" Assuming, as seems certain!, that Norwegians have attitudes about climate change, it's pretty safe to expect a single item like this to tap into it in the same that the Industrial Strength Risk Perception Measure would. Aasen likely handicapped her detection of the strength of the influences she measured, however, by dichotomizing this measure ("Quite concerned" & "very concerned" vs. "a little concerned" & "Not at all") rather than treating it as a 4 point ordinal one.
Aasen's "individualism" scale was apparently substantially more reliable than her "egalitarianism" one (the α's are reported as "> 0.70" and "> 0.30," respectively). But assuming the indicators have the requisite relationship with the underlying disposition, low reliability doesn't bias results; it just attenuates the strength of them.
So it's pretty cool to now see evidence of the same sorts of cultural divisions in Norway as we see in the US (Kahan et al. 2012), UK (Kahan et al. 2015), Australia (Guy, Kashima & O'Neill 2014), & Switzerland (Yi et al. 2015), etc. Maybe Aasen will follow up by adapting the "cultural cognition worldview" scales for Norwegian sample!
But what really got my attention was the overall level of concern in the sample:
Yes, "individualism" and "Hierarchy" (the attitude opposite in valence to "egalitarianism") predict a steeper decline in concern after 2007, and obviously explain a lot more variance in 2011 than in 2003.
But look, first, at how modest" concern" was even for most "egalitarian" and "communitarian" (opposite of individualistic) respondents; and, second, the universality of the decline in concern since 2007.
The climate-concern item seems to be the international equivalent of a Gallup item that asks U.S. respondents "how worried" they are about "global warming" or "climate change" ("great deal," "fair amount," "only a little," or "not at all"). Here's what U.S. responses (combining the equivalent response categories) look like (with the period the overlaps w/ Aasen's data bounded by dotted lines):
You can see that the divide along "individualist-communitarian" and "egalitarian-hierarchy" lines in Norway is less extreme than the Democrat-Republican one in the U.S. Actually, if we had data for the U.S. respondents' cultural worldviews, the greater degree of polarization in the U.S. would be shown to be even more substantial.
But again, that's not as intriguing to me is what the data show about the relative levels of "concern"/"worry" in the two nations. The U.S. population is not particularly "worried" on average, but apparently Norwegians are even less "concerned," as can be see by this composite graphic, which charts the corresponding sets of responses for both nations, respectively, in the years for which there are data (note: Aasen supplied me with the Norwegian means; this Figure supercedes a slightly but not materially different one reflecting estimates from the model presented in the paper):
The trends are very comparable, and maybe the question wording or some cross-cultural exchange rate in how respondents indicate their attitudes explains the gap.
But clearly (by this measure at least) Norway is not more concerned than the U.S., which according to common wisdom "leads the world in climate denial."
Indeed, the segment of society most culturally predisposed to worry about climate change in Norway is no more concerned than the "average" American.
So what's going on in that country?!
Maybe we can entice Aasen into a guest post. I've already offered her the standard MOP$50,000.00 fee (payable in future stock options in CCP, Inc.), but I'm confident she, like other guests, will waive the fee to affirm that enlarging human knowledge is their only motivation for being a scholar (of course, there is still ambiguity, given the fame & celebrity endorsements, particularly in Macao, that comes with being a CPP Blog guest poster).
We'll see what she says!
But for meantime, this very interesting & cool paper supplies material for a fresh lesson about the dangers of "selecting on the dependent variable" in the science of science communication: If one tests one's theory of U.S. public opinion on climate change by considering only how well it "fits" the data in the U.S., then obviously one will be excluding the possibility of observing both comparable states of public opinion in societies where the asserted explanation ("balanced media norms," a creeping public "anti-science" sensibility, Republican brains, etc.) doesn't apply and divergent states of public opinion in societies in which the asserted explanation applies just as well (Shehata & Hopmann 2012).
Aasen, M. The polarization of public concern about climate change in Norway., Climate Policy (2015), advance online publication.
Kahan, D.M., Hank, J.-S., Tarantola, T., Silva, C. & Braman, D. Geoengineering and Climate Change Polarization: Testing a Two-Channel Model of Science Communication. Annals of the American Academy of Political and Social Science 658, 192-222 (2015).
Kahan, D.M., Peters, E., Wittlin, M., Slovic, P., Ouellette, L.L., Braman, D. & Mandel, G. The polarizing impact of science literacy and numeracy on perceived climate change risks. Nature Climate Change 2, 732-735 (2012).
Or at least not at Cornell University, where I gave 3 lectures Thurs. & had follow up meetings w/ folks Friday.
This is a university that gets the importance of integrating the practice of science and science-informed policymaking with the science of science communication. The number of scholars across various departments in both the natural and social sciences who are applying themselves to this objective in their scholarship and pedagogy is pretty amazing.
No. 1 was a tallk for the Gloal Leadership Fellows affiliated with the Cornell Alliance for Science (“a global initiative for science-based communication”). B/c the Fellows--an amazingly smart & talented group of science communication professionals & students-- were going to tail me for the rest of the day, I thought I should pose a couple of questions that they could think about & that I’d answer in later lectures. Of course, I asked them for their own answers in the meantime. Since theirs answers were, predictably, better than the ones I was going to give, I just substituted theirs for mine later in the day--who would notice, right?
The questions were:
1. Do U.S. farmers believe in climate change? &
2. Do evolution non-believers enjoy watching documentaries on human evolution?
No. 2 was lecture to class “The GMO Debate: Science, Society, and Global Impacts.” Title of my talk was, “Are GMOs toxic for the science communication environment? Vice versa?” I think I might have been the first person to break the news to them that there isn’t any public contestation over GM foods in the U.S.
If so, then, maybe you'll staty tuned. An excerpt from something I'm working on:
. . . . As conceptualized here, science curiosity is not a transient state (see generally Lowenstein 1994), but instead a general disposition, variable in intensity across persons, that reflects the motivation to seek out and consume scientific information for personal pleasure.
A valid measure of this disposition could be expected to make to make myriad contributions to knowledge. Such an instrument could be used to improve science education, for example, by facilitating investigation of the forms of pedagogy most likely to promote the development of science curiosity and harness it to promote learning (Blalock, Lichtenstein, Owen & Pruski 2008). A science curiosity measure could likewise be used by science journalists, science filmmakers, and similar professionals to perfect the appeal of their work to those individuals who value it the most (Nisbet & Aufdheide 2009). Those who study the science of science communication (Fischhoff & Scheufele 2014; Kahan 2015) could also use a science curiosity measure to deepen their understanding of how public interest in science shapes the responsiveness of democratically accountable institutions to policy-relevant evidence.
Indeed, the benefits of measuring science curiosity are so numerous and so substantial that it would be natural to assume researchers must have created such a measure long ago. But the plain truth is that they have not. “Science attitude” measures abound. But every serious attempt to assess their performance has concluded that they are psychometrically weak and, more importantly, not genuinely predictive of what they are supposed to be assessing—namely, the disposition to seek out and consume scientific information for personal satisfaction.
We report the results of a reasearch measure consciously designed to remedy this research deficit....
Blalock, C.L., Lichtenstein, M.J., Owen, S., Pruski, L., Marshall, C. & Toepperwein, M. In Pursuit of Validity: A comprehensive review of science attitude instruments 1935–2005. International Journal of Science Education 30, 961-977 (2008).
Fischhoff, B. & Scheufele, D.A. The science of science communication. Proceedings of the National Academy of Sciences 110, 14031-14032 (2013).
Been busy at work on CCP "Evidence-based Science Filmmaking Initiative" (ESFI), and hence neglecting the 14 billion readers of blog... Sorry!
Am hoping what we will have to say on the progress we've been making will compensate. More on that soon-- very soon.
But just to feed you enough information to prevent utter starvation, the coolest thing so far is a behaviorally validated Science Curiosity Index (SCI), which measures the disposition to seek out & consume science information for personal satisfaction. It's amazing what one learns about science curiosity, which is definitely not the same thing as the science-comprehension disposition measured by Ordinary Science Intelligence, tells us about how people process information about contested science issues.
But more soon-- very soon, I promise!
A comment on Lee Jussim's Social Perception and Social Reality: Why Accuracy Dominates Bias and Self-Fulfilling Prophecy (Oxford 2012).
This comment uses the dynamic of identity-protective cognition to pose a friendly challenge to Jussim (2012). The friendly part consists of an examination of how this form of information processing, like many of the ones Jussim describes, has been mischaracterized in the decision science literature as a “cognitive bias”: in fact, identity-protective cognition is a mode of engaging information rationally suited to the ends of the agents who display it. The challenging part is the manifest inaccuracy of the perceptions that identity-protective cognition generates. At least some of the missteps induced by the “bounded rationality” paradigm in decision science reflect its mistaken assumption that the only thing people use their reasoning for is to form accurate beliefs. Jussim’s critique of the bounded-rationality paradigm, the comment suggests, appears to rest on the same mistaken equation of rational information processing with perceptual accuracy.
As discussed previously (here & here), a pair of economists have generated quite a bit of agitation and excitement by exposing an apparent flaw in the methods of the classic “hot hand fallacy” studies.
These studies purported to show that, contrary to popular understanding not only among sports fans but among professional athletes and coaches, professional basketball players do not experience “hot streaks,” or periods of above-average performance longer in duration than one would expect to see by chance. The papers in questions have for thirty years enjoyed canonical status in the field of decision science research as illustrations of the inferential perils associated with the propensity of human beings to look for and see patterns in independent events.
Actually, the reality of that form of cognitive misadventure isn’t genuinely in dispute. People are way too quick to discern signal in noise.
But what is open to doubt now is whether the researchers used the right analytical strategy in testing whether this mental foible is the source of the widespread impression that professional basketball players experience "hot hands."
I won’t rehearse the details—in part to avoid the amusingly embarrassing spectacle of trying to make intuitively graspable a proof that stubbornly assaults the intuitions of highly numerate persons in particular—but the nub of the proof supplied by the challenging researchers, Joshua Miller & Adam Sanjurjo, is that the earlier researchers mistakenly treated “hit” and “missed” shots as recorded in a previous, finite sequence of shots as if they were independent. In fact, because the proportion of “hits” and “misses” in a past sequence is fixed, strings of “hits” should reduce the likelihood of subsequent “hits” in the remainder of the sequence. Not taking this feature of sampling without replacement into account caused the original “hot hand fallacy” researchers to miscalculate the “null" in a manner that overstated the chance probability that a player would hit another shot after a specified string of hits....
Bottom line is that the data in the earlier studies didn’t convincingly rule out the possibility that basketball players’ performances did indeed display the sort of “streakiness” that defies chance expectations and supports the “hot hand” conjecture.
But in any case . . . the point of this update is to call attention to the truly admirable and inspiring reaction of the original researchers to the news that their result had been called into question in this way.
As I said, the “hot hand fallacy” studies are true classics. One could understand if those who had authored such studies would react defensively (many others who have been party to celebrating the studies for the last 30 yrs understandably have!) to the suggestion that the studies reflect a methodological flaw, one that itself seems to reflect the mischief of an irresistible but wrong intuition about how to distinguish random from systematic variations in data.
But instead, the reaction of the lead researcher to the M&S result, Tom Gilovich, is: “Coooool!!!!!!!!”
“Unlike a lot of stuff that’s come down the pike since 1985,” Gilovich was quoted as saying in a Wed. Wall Street Journal piece,
this is truly interesting,” Gilovich said. “What they discovered is correct.” Whether the real effect is “so small that the original conclusion stands or needs to be modified,” he said, “is what needs to be determined. Whether the real effect is “so small that the original conclusion stands or needs to be modified,” he said, “is what needs to be determined.”
The article goes on to report that Gilovich, along with others, is now himself contemplating re-analyses and new experiments to try to do exactly that.
In a word, Gilovich, far from have his nose bent out of joint by the M&S finding, is excited that aruly unexpected development is now furnishing him and others with a chance to resume investigation of an interesting and complex question.
I bet, too, that at least part of what intrigues Gilovich is how a mistake like this could have evaded the attention to decision scientists for this long –-and why even now the modal reaction among readers of the M&S paper is “BS!!” It takes about 45.3 (± 7) readings to really believe M&S’s proof, and even then the process has to be repeated at weekly intervals for a period of two months before the point they are making itself starts to seem intuitive enough to have the ring of truth.
But the point is, Gilovich, whose standing as a preeminent researcher is not diminished one iota by this surprising turn in the scholarly discussion his work initiated, has now enriched us even more by furnishing us with a compelling and inspiring example of the mindset of a real scholar!
Whatever embarrassment he might have been expected to experience (none is warranted in my view, nor evident in the WSJ article), is dwarfed by his genuine intellectual excitement over a development that is truly cool & interesting—both for what it teaches us about a particular problem in probability and for the opportunity it furnishes to extent examination into human psychology (here, the distinctive vulnerability to error that likely is itself unique to people with intuitions fine-tuned to avoid making the mistakes that intuitions characteristically give rise to when people try to make sense of randomness).
I’m going to try to reciprocate the benefit of the modeling of scholarly virtue Gilovich is displaying by owning up to, and getting excited about, as many mistakes in my own previous work as I can find!
from correspondence ...
Dear Prof Kahan,I’m working on an article describing how our ideologies skew our ability to deal with the facts, no matter how true/scientifically sound they are. While researching this, I (obviously ;) landed upon your research. I’ve been eagerly reading papers and posts on Cultural Cognition –site, but there are couple of things I’m still unsure of. Namely:1 How does cultural cognition differ from motivated reasoning? Or is the latter included in the former; thus motivated reasoning is merely cultural cognition ”in action”?
2 Are smart people more prone to twist given facts so that they fit into their existing beliefs/values? Or are intelligent persons just moreskillful in this process...?
3 Is motivated reasoning unconsious reaction? Do we know we do it? Does everybody do it, even the ones who try not to?
4 If motivated reasoning is unconsious (= automatic), how on earth do we stop it? Can we?
I have to confess this whole phenomena bothers me to the bone, both as a human being and (especially) as a science journalist. How can we, how can anyone promote rational ideas or actions or work towards the kind of society s/he thinks is worthwhile, if s/he doesn’t first know how thing are, thus is able to take in the facts?
I would appreciate enormously, if you found a minute answering me.With kind regards,
Okay, here’s a set of reflections that seem topical as another school year begins.
The reflections can be structured with reference to a question:
What’s the difference between a lawyer and a chick sexer?
It’s not easy, at first, to figure out what they have in common. But once one does, the risk that one won’t see what distinguishes them is much bigger, in actuarial and consequential terms.
I tell people about the link between them all the time—and they chuckle. But in fact, I spend hours and hours and hours per semester eviscerating comprehension of the critical distinction between them in people who are filled with immense intelligence and ambition, and who are destined to occupy positions of authority in our society.
Anyway, the chick sexer is the honey badger of cognitive psychology: relentlessly fascinating, and adorable. But because cognitive psychology doesn’t have nearly as big a presence on Youtube as do amusing voice-overs of National Geographic wildlife videos, the chick sexer is a lot less famous.
So likely you haven’t heard of him or her.
But in fact the chick sexer plays a vital role in the poultry industry. It’s his or her responsibility to separate the baby chicks, moments after birth, on the basis of gender.
The females are more valuable, at least from the point of view of the industry. They lay eggs. They are also plumper and juicier, if one wants to eat them. Moreover, the stringy scrawny males, in addition to being not good for much, are ill-tempered & peck at the females, steal their food, & otherwise torment them.
So the poultry industry basically just gets rid of the males (or the vast majority of them; a few are kept on and lead a privileged existence) at soonest opportunity—minutes after birth.
The little newborn hatchlings come flying (not literally; chickens can’t fly at any age) down a roomful of conveyor belts, 100’s per minute. Each belt is manned (personed) by a chick sexer, who deftly plucks (as in grabs; no feathers at this point) each chick off the belt, quickly turns him/her over, and in a split second determines the creature’s gender, tossing the males over his or her shoulder into a “disposal bin” and gently setting the females back down to proceed on their way.
They do this unerringly—or almost unerringly (99.99% accuracy or whatever).
Which is astonishing. Because there’s no discernable difference, or at least one that anyone can confidently articulate, in the relevant anatomical portions of the minutes-old chicks.
You can ask the chick sexer how he or she can tell the difference. Many will tell you some story about how a bead of sweat forms involuntarily on the male chick beak, or how he tries to distract you by asking for the time of day or for a cigarette, or how the female will hold one’s gaze for a moment longer or whatever.
This is all bull/chickenshit. Or technically speaking, “confabulation.”
Indeed, the more self-aware and honest members of the profession just shrug their shoulders when asked what it is that they are looking for when they turn the newborn chicks upside down & splay their little legs.
But while we don’t know what exactly chicksexers are seeing, we do know how they come to possess their proficiency in distinguishing male from female chicks: by being trained by a chick-sexing grandmaster.
For hours a day, for weeks on end, the grandmaster drills the aspiring chick sexers with slides—“male,” “female,” “male,” “male,” “female,” “male,” “female,” “female”—until they finally acquire the same power of discernment as the grandmaster, who likewise is unable to give a genuine account of what that skill consists in.
This is a true story (essentially).
What the chick sexer does to discern the gender of chicks is an instance of pattern recognition.
Pattern recognition is a cognitive operation in which we classify a phenomenon by rapidly appraising it in comparison to large stock of prototypes acquired by experience.
The classification isn’t made via conscious deduction from a set of necessary and sufficient conditions but rather tacitly, via a form of perception that is calibrated to detect whether the object possesses a sufficient number of the prototypical attributes—as determined by a gestalt, “critical mass” intuition—to count as an instance of it.
All manner of social competence—from recognizing faces to reading others emotions—depend on pattern recognition.
But so do many do specialized ones. What distinguishes a chess grandmaster from a modestly skilled amature player isn’t her capacity to conjure and evaluate a longer sequence of potential moves but rather her ability to recognize favorable board positions based on their affinity to a large stock of ones she has determined by experience to be advantageous.
Professional judgment, too, depends on pattern recognition.
For sure, being a good physician requires the capacity and willingness to engage in conscious and unbiased weighing of evidence diagnostic of medical conditions. But that’s not sufficient; unless the doctor includes only genuinely plausible illnesses in her set of maladies worthy of such investigation, the likelihood that she will either fail to test for the correct, one fail to identify it soon enough to intervene effective, will be too low.
Expert forensic auditors must master more than the technical details of accounting; they must acquire a properly calibrated capacity to recognize the pattern of financial irregularity that helps them to extract evidence of the same from mountains of business records.
The sort of professional judgment one needs to be a competent lawyer depends on a properly calibrated capacity for pattern recognition, too.
Indeed, this was the key insight of Karl Llewellyn. The most brilliant member of the Legal Realist school, Llewellyn observed that legal reasoning couldn’t plausibly be reduced to deductive application of legal doctrines. Only rarely were outcomes uniquely determined by the relevant set of formal legal materials (statutes, precedents, legal maxims, and the like).
The solution he proposed was professional “situation sense”: a perceptive faculty, acquired by education and experience, that enabled lawyers to reliably appraise specific cases with reference to a stock of prototypical “situation types,” the proper resolution of which that was governed by shared apprehensions of “correctness” instilled by the same means.
This feature of Llewellyn’s thought—the central feature of it—is weirdly overlooked by many scholars who characterize themselves as “realists” or New Realists,” and who think that Llewellyn’s point was that because there’s no “determinacy” in “law,” judges must be deciding on the basis of “political” sensibilities of the conventional “left-right” sort, generating differences in outcome across judges of varying ideologies.
It’s really hard to get Llewellyn more wrong than that!
Again, his project was to identify how there could be pervasive agreement among lawyers and judges on what the law is despite its logical indeterminacy. His answer was that members of the legal profession, despite heterogeneity in their “ideologies” politically understood, shared a form of professionalized perception—“situation sense”—that by and large generated convergence on appropriate outcomes the coherence of which would befuddle non-lawyers.
Llewellyn denied, too, that the content of situation sense admitted of full specification or articulation. The arguments that lawyers made and the justifications that judges give for their decisions, he suggested, were post hoc rationalizations.
Does that mean that for Lewellyn, legal argument is purely confabulatory? There are places where he seems to advance that claim.
But the much more intriguing and I think ultimately true explanation he gives for the practice of reason-giving in lawyerly argument (or just for lawyerly argument) is its power to summon and focus “situation sense”: when effective, argument evokes both apprehension of the governing “situation” and motivation to reach a situation-appropriate conclusion.
Okay. Now what is analogous between lawyering and chick-sexing should be readily apparent.
The capacity of the lawyer (including the one who is a judge) to discern “correct” outcomes as she grasps and manipulates indeterminate legal materials is the professional equivalent of—and involves the exercise of the same cognitive operation as—the chicksexer’s power to apprehend the gender of the day-old chick from inspection of its fuzzy, formless genetalia.
In addition, the lawyer acquires her distinctive pattern-recognition capacity in the same way the chick sexer acquires his: through professional acculturation.
What I do as a trainer of lawyers is analogous to what the chicksexer grandmaster does. “Proximate causation,” “unlawful restraint of trade,” “character propensity proof/permissible purpose,” “collateral (not penal!) law”—“male,” “male,” “female,” “male”: I bombard my students with a succession of slides that feature the situation types that stock the lawyer’s inventory, and inculcate in students the motivation to conform the results in particular cases to what those who practice law recognize—see, feel—to be the correct outcome.
It works. I see it happen all the time.
It’s quite amusing. We admit students to law school in large part because of their demonstrated proficiency in solving the sorts of logic puzzles featured on the LSAT. Then we torment them, Alice-in-Wonderland fashion, by presenting to them as “paradigmatic” instances of legal reasoning outcomes that clearly can’t be accounted for by the contorted simulacra of syllogistic reasoning that judges offer to explain them.
They stare uncomprehendingly at written opinions in which a structural ambiguity is resolved one way in one statute and the opposite way in another--by judges who purport to be following the “plain meaning” rule.
They throw their hands up in frustration when judges insist that their conclusions are logically dictated by patently question-begging standards (“when the result was a reasonably foreseeable consequence of the defendant’s action. . . “) that can be applied only on the basis of some unspecified, and apparently not even consciously discerned, extra-doctrinal determination of the appropriate level of generality at which to describe the relevant facts.
But the students do learn—that the life of the law is not “logic” (to paraphrase, Holmes, a proto-realist) but “experience,” or better, perception founded on the “experience” of becoming a lawyer, replete with all the sensibilities that being that sort of professional entails.
The learning is akin to the socialization process that the students all experienced as they negotiated the path from morally and emotionally incompetent child to competent adult. Those of us who are already socially competent model the right reactions for them in our own reactions to the materials—and in our reactions to the halting and imperfect attempts of the students to reproduce it on their own.
“What,” I ask in mocking surprise, “you don’t get why these two cases reached different results in applying the ‘reasonable foreseeability’ standard of proximate causation?”
Seriously, you don’t see why, for an arsonist to be held liable for causing the death of firefighters, it's enough to show that he could ‘reasonably foresee’ 'death by fire,' whether or not he could foresee ‘death by being trapped by fires travelling the particular one of 5x10^9 different paths the flames might have spread through a burning building'?! But why ‘death by explosion triggered by a spark emitted from a liquid nitrate stamping machine when knocked off its housing by a worker who passed out from an insulin shock’—and not simply 'death by explosion'—is what must be "foreseeable" to a manufacturer (one warned of explosion risk by a safety inspector) to be convicted for causing the death of employees killed when the manufacturer’s plant blew up?
"Anybody care to tell Ms. Smith what the difference is,” I ask in exasperation.
Or “Really,” I ask in a calculated (or worse, in a wholly spontaneous, natural) display of astonishment,
you don’t see why somoene's ignorance of what's on the ‘controlled substance’ list doesn’t furnish a "mistake of law" defense (in this case, to a prostitute who hid her amphetamines in tin foil wrap tucked in her underwear--is that where you keep your cold medicine or ibuprofen! Ha ha ha ha ha!!), but why someone's ignorance of the types of "mortgage portfolio swaps" that count as loss-generating "realization events" under IRS regs (the sort of tax-avoidance contrivance many of you will be paid handsomely by corporate law form clients to do) does furnish one? Or why ignorance of the criminal prohibition on "financial structuring" (the sort of strategem a normal person might resort to to hide assets from his spouse during a divorce proceeding) furnishes a defense as well?!
Here Mr. Jones: take my cellphone & call your mother to tell her there’s serious doubt about your becoming a lawyer. . . .
This is what I see, experience, do. I see my students not so much “learning to think” like lawyers but just becoming them, and thus naturally seeing what lawyers see.
But of course I know (not as a lawyer, but as a thinking person) that I should trust how things look and feel to me only if corroborated by the sort of disciplined observation, reliable measurement, and valid causal inference distinctive of empirical investigation.
So, working with collaborators, I design a study to show that lawyers and judges are legal realists—not in the comic-book “politicians in robes” sense that some contemporary commentators have in mind but in the subtle, psychological one that Llewellyn actually espoused.
Examining a pair of genuinely ambiguous statutes, members of the public predictably conform their interpretation of them to outcomes that gratify their partisan cultural or political outlooks, polarizing in patterns the nature of which are dutifully obedient to experimental manipulation of factors extraneous to law but very relevant indeed to how people with those outlooks think about virtue and vice.
But not lawyers and judges: they converge on interpretations of these statutes, regardless of their own cultural outlooks and regardless of experimental manipulations that vary which outcome gratifies those outlooks.
They do that not because, they, unlike members of the public, have acquired some hyper-rational information-processing capacity that blocks out the impact of “motivated reasoning”: the lawyers and judges are just as divided as members of the public, on the basis of the same sort of selective crediting and discrediting of evidence, on issues like climate change, and legalization of marijuana and prostitution.
Rather the lawyers and judges converge because they have something else that members of the public don’t: Llewellyn’s situation sense—a professionalized form of perception, acquired through training and experience, that reliably fixes their attention on the features of the “situation” pertinent to its proper legal resolution and blocks out the distracting allure of features of it that might be pertinent to how a non-lawyer—i.e., a normal person, with one or another kind of “sense” reliably tuned to enabling them to be a good member of a cultural group on which their status depends . . . .
So, that’s what lawyers and chick sexers have in common: pattern recognition, situation sense, appropriately calibrated to doing what they do—or in a word professional judgment.
But now, can you see what the chick sexer and the lawyer don’t have in common?
Perhaps you don’t; because even in the course of this account, I feel myself having become an agent of the intoxicating, reason-bypassing process that imparting “situation sense” entails.
But you might well see it—b/c here all I’ve done is give you an account of what I do as opposed to actually doing it to you.
We know something important about the chick sexer’s judgment in addition to knowing that it is an instance of pattern recognition: namely, that it works.
The chick sexer has a mission in relation to a process aimed at achieving a particular end. That end supplies a normative standard of correctness that we can use not only to test whether chick sexers, individually and collectively, agree in their classifications but also on whether they are classifying correctly.
Obviously, we’ll have to wait a bit, but if we collect rather than throw half of them a way, we can simply observe what gender the baby chicks classified by the sexer as “male” and “female” grow up to be.
If we do that test, we’ll find out that the chick sexers are indeed doing a good job.
We don’t have that with lawyers’ or judges’ situation sense. We just don’t.
We know they see the same thing; that they are, in the astonishing way that fascinated Llewellyn, converging in their apprehension of appropriate outcomes across cases that “lay persons” lack the power to classify correctly.
But we aren’t in a position to test whether they are seeing the right thing.
What is the goal of the process the lawyers and judges are involved in? Do we even agree on that?
I think we do: assuring the just and fair application of law.
That’s a much more general standard, though, than “classifying the gender of chicks.” There are alternative understandings of “just” and “fair” here.
Actually, though, this is still not the point at which I’m troubled. Although for sure I think there is heterogeneity in our conceptions of the “goals” that the law aims at, I think they are all conceptions of a liberal political concept of “just” and “fair,” one that insists that the state assume a stance of neutrality with respect to the diverse understandings of the good life that freely reasoning individuals (or more accurately groups of individuals) will inevitably form.
But assuming that this concept, despite its plurality of conceptions, has normative purchase with respect to laws and applications of the same (I believe that; you might not, and that’s reasonable), we certainly don’t have a process akin to the one we use for chick sexers to determine whether lawyers and judges’ situation sense is genuinely calibrated to achieving it.
Or if anyone does have such a process, we certainly aren’t using it in the production of legal professionals.
To put it in terms used to appraise scientific methods, we know the professional judgment of the chick sexer is not only reliable—consistently attuned to whatever it is that appropriately trained members of their craft are unconsciously discerning—but also valid: that is, we know that the thing the chick sexers are seeing (or measuring, if we want to think of them as measuring instruments of a special kind) is the thing we want to ascertain (or measure), viz., the gender of the chicks.
In the production of lawyers, we have reliability only, without validity—or at least without validation. We do successfully (remarkably!) train lawyers to make out the same patterns when they focus their gaze at the “mystifying cloud of words” that Cardozo identified the law as comprising. But we do nothing to assure that what they are discerning is the form of justice that the law is held forth as embodying.
Observers fret—and scholars using empirical methods of questionable reliability and validity purport to demonstrate—that judges are mere “politicians in robes,” whose decisions reflect the happenstance of their partisan predilections.
That anxiety that judges will disagree based on their “ideologies” bothers me not a bit.
What does bother me—more than just a bit—is the prospect that the men and women I’m training to be lawyers and judges will, despite the diversity of their political and moral sensibilities, converge on outcomes that defy the basic liberal principles that we expect to animate our institutions.
The only thing that I can hope will stop that from happening is for me to tell them that this is how it works. Because if it troubles me, I have every reason to think that they, as reflective decent people committed to respecting the freedom & reason of others, will find some of this troubling too.
Not so troubling that they can’t become good lawyers.
But maybe troubling enough that they won't stop being reflective moral people in their careers as lawyers; troubling enough so that if they find themselves in a position to do so, they will enrich the stock of virtuous-lawyer prototypes that populate our situation sense by doing something that they, as reflective, moral people—“conservative” or “liberal”—recognize is essential to reconciling being a “good lawyer” with being a member of a profession essential to the good of a liberal democratic regime.
That can happen, too.
Right . . . So yesterday I posted part I of this series, which is celebrating the bicentennial , or perhaps it’s the tricentennial—one loses track after a while--of the “NHT Fallacy” critique
The nerve of it is that “rejection of the null [however it is arbitrarily defined] at p < 0.05 [or p < 10^-50 or whatever]” furnishes no inferentially relevant information in hypothesis testing. To know whether an observation counts as evidence in support of a hypothesis, the relevant information is not how likely we were to observe a particular value if the “null” is true but how much more or less likely we were to observe that value if a particular hypothesized true “value” is correct than if another hypothesized “true” value is correct (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999a; Gigerenzer 2004).
Actually, I’m not sure when the first formulation of the critique appeared. Amusingly, in his 1960 classic The Fallacy of the Null-hypothesis Significance Test, Rosenbloom, apologetically characterized his own incisive attack on the inferential barrenness of NHT as “not a particularly original view”!
The critique has been refined and elaborated many times, in very useful ways, since then, too. Weirdly, the occasion for so many insightful elaborations has been the persistence of NHT despite the irrefutable proofs of those critiquing it.
More on that in in a bit, but probably the most interesting thing that has happened in the career of the critique in the last 50 yrs. or so has been the project to devise tractable alternatives to NHT that really do quantify the evidentiary weight of any particular set of data.
I’m certainly not qualified to offer a reliable account of the intellectual history of using Bayesian likelihood ratios as a test statistic in the social sciences (cf. Good. But the utlity of this strategy was clearly recognized by Rozeboom, who observed that the inferential defects in NHT could readily be repaired by analytical tools forged in the kiln of “the classic theory inverse probabilities.”
The “Bayes Factor” –actually “the” misleadingly implies that there is only one variant of it—is the most muscular, deeply theorized version of the strategy.
But one can, I believe, still get a lot of mileage out of less technically elaborate analytical strategies using likelihood ratios to assess the weight of the evidence in one’s data (e.g., Goodman, 1999b).
For many purposes, I think, the value of using Bayesian likelihood ratios is largely heuristic: having to specify the predictions that opposing plausible hypotheses would generate with respect to the data, and to formulate an explicit measure of the relative consistency of the observed outcome with each, forces the researcher to do what the dominance of NHT facilitates the evasion of: the reporting of information that enables a reflective person to draw an inference about the weight of the evidence in relation to competing explanations of the dynamic at issue.
That’s all that’s usually required for others to genuinely learn from and critically appraise a researcher’s work. For sure there are times when everything turns on how precisely one is able to estimate some quantity of interest, where key conceptual issues about how to specify one or another parameter of a Bayes Factor will have huge consequence for interpretation of the data.
But in lots of experimental models, particularly in social psychology, it’s enough to be able to say “yup, that evidence is definitely more consistent—way more consistent—with what we’d expect to see if H1 rather than H2 is true”—or instead, “wait a sec, that result is not really any more supportive of that hypothesis than this one!” In which case, a fairly straightforward likelihood ratio analysis can, I think, add a lot, and even more importantly avoid a lot of the inferential errors that accompany permitting authors to report “p < 0.05” and then make sweeping, unqualified statements not supported by their data.
That’s exactly the misadventure, I said “yesterday,” that a smart researcher experienced with NHT. That researcher found a “statistically significant” correlation (i.e., rejection of the “null at p<0.0xxx”) between a sample of Univ of Ky undergraduate’s CRT scores (Frederick 2005) and their responses to a standard polling question on “belief in” evolution; he then treated that as corroboration of his hypothesis that “individuals who are better able to analytically control their thoughts are more likely” to overcome the intuitive attraction of the idea that “living things, are ... intentionally designed by some external agent” to serve some “function and purpose,” and thus “more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species."
But as I pointed out, the author’s data, contrary to his assertion, unambiguously didn’t support that hypothesis.
Rather than showing that “analytic thinking consistently predicts endorsement of evolution,” his data demonstrated that knowing the study subjects’ CRT scores furnished absolutely no predictive insight into their "evolution beliefs." The CRT predictor in the author’s regression model was “statistically significant” (p < 0.01), but was way too small in size to outperform a “model” that simply predicted “everyone” in the author’s sample—regardless of their CRT score—rejected science’s account of the natural history of human beings.
(Actually, there were even more serious—or maybe just more interesting—problems having to do with the author’s failure to test the data's relative support for a genuine alternative about how cognitive reflection relates to "beliefs" in evolution: by magnifying the opposing positions of groups for whom "evolution beliefs" have become (sadly, pointlessly, needlessly) identity defining. But I focused “yesterday” on this one b/c it so nicely illustrates the NHT fallacy.)
Had he asked the question that his p-value necessarily doesn’t address—how much more consistent is the data with one hypothesis than another—he would have actually found out that the results of his study was more consistent with the hypothesis that “cognitive reflection makes no goddam difference” in what people say when they answer a standard “belief in evolution” survey item of the sort administered by Gallup or Pew.
The question I ended on, then, was,
How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don't?
That’s a very complicated and interesting question, and so now I’ll offer my own answer, one that uses the inference-disciplining heuristic of forming a Bayesian likelihood ratio.
1. Using a Baysian likelihood ratio is not, in my view, the only device that can be used to extract from data like these the information necessary to form cogent inferences about the support fo the data for study hypotheses. Anything that helps the analyst and reader guage the relative support of the data for the study hypothesis in relation to a meaningful or set of meaningful alternatives can do that.
Often it will be *obvious* how the data do that, given the sign of the value observed in the data or the size of it in relation to what common understanding tells one the competing hypotheses would predict.
But sometimes those pieces of information might not be so obvious, or might be open to debate. Or in any case, there could be circumstances in which extracting the necessary information is not so straightforward and in which a device like forming a Bayesian likelihood ratio in relation to the competing hypotheses helps, a lot, to figure out what the inferential import of the data are.
That's the pragmatic position I mean to be staking out here in advocating alternatives to the pernicious convention of permitting researchers to treat "p < 0.05" as evidence in support of a study hypothesis.
2. My "Bayesian likelihood ratio" answer here is almost surely wrong!
But it is at least trying to answer the right question, and by putting it out there, maybe I can entice someone else who has a better answer to share it.
Indeed, it was exactly by enticing others into scholarly conversation that I came to see what was cool and important about this question. Without implying that they are at all to blame for any deficiencies in this analysis, it’s one that emerged from my on-line conversations with Gordon Pennycook, who commented on my original post on this article, and my off-line ones with Kevin Smith, who shared a bunch of enlightening thoughts with me in correspondence relating to a post that I did on an interesting paper that he co-authored.
Here’s the most important thing to realize: the CRT is friggin hard!
It turns out that the median score on the CRT, a three-question test, is zero when administered to the general population. I kid you not: studies w/ general population samples (not student or M Turk or ones to sites that recruit from visitors to a website that offers to furnish study subjects with information on the relationship between their moral outlooks and their intellectual styles) show that 60% of the subjects can't get a single answer correct.
Hey, maybe 60% of the population falls short of the threshold capacity in conscious, effortful information processing that critical reasoning requires. I doubt that but it's possible.
What that means, though, is that if we use the CRT in a study (as it makes a lot of sense to do; it’s a pretty amazing little scale), we necessarily can't get any information from our data on differences in cognitive reflection among a group of people comprising 60% of the population. Accordingly, if we had two groups neither of whose mean scores were appreciably above the "population mean," we'd be making fools of ourselves to think we were observing any real difference: the test just doesn't have any measurement precision or discrimination at that "low" a level of the latent disposition.
We can be even more precise about this -- and we ought to be, in order to figure out how "big" a difference in mean CRT scores would warrant saying stuff like "group x is more reflective than group y" or "differences in cognitive reflection 'predict'/'explain' membership in group x as opposed to y...."
Using item response theory, which scores the items on the basis of how likely a person with any particular level of the latent disposition (theta) is to get that particular item correct, we can assess the measurement precision of an assessment instrument at any point along theta. We can express that measurement precision in terms of a variable "reliability coefficient," which reflects what fraction of the differences in individual test scores in that vicinity of theta is attributable to "true differences" & how much to measurement error.
Here's what we get for CRT (based on a general population sample of about 1800 people):
The highest degree of measurement precision occurs around +1 SD, or approximately "1.7" answers correct. Reliability there is 0.60, which actually is pretty mediocre; for something like the SAT, it would be pretty essential to have 0.8 along the entire continuum from -2 to +2 SD. That’s b/c there is so much at stake, both for schools that want to rank students pretty much everywhere along the continuum, and for the students they are ranking.
But I think 0.60 is "okay" if one is trying to make claims about groups in general & not rank individuals. If one gets below 0.5, though, the correlations between the latent variable & anything else will be so attenuated as to be worthless....
So here are some judgments I'd make based on this understanding of the psychometric properties of CRT:
If I want to see if groups differ in the reflectiveness, then, I should not be looking to see if the difference in their CRT scores is "significant p < 0.05," since that by itself won't support any inferences relating to the hypotheses given my guidelines above.
If one group has a "true" mean CRT score that is in the "red" zone, the hypothesis that it is less reflective than another group can be supported with CRT results only if the latter group's "true" mean score is in the green zone.
So how can we can this information to form a decent hypothesis testing strategy here?
Taking the "CRT makes no goddam difference" position, I'm going to guess that those who "don't believe" in evolution are pretty close to the population mean of "0.7." If so, then those who "do believe" will need to have a “true” mean score of +0.5 SD or about "1.5 answers correct" before there is a "green to red" zone differential.
That's a difference in mean score of approximately "0.8 answers correct."
Thus, the "believers more reflective" hypothesis, then, says we should expect to find that believers will have a mean score 0.8 points higher than the population mean, or 1.5 correct.
The “no goddam difference” hypothesis, we’ll posit, predicts the "null": no difference whatsoever in mean CRT scores of the believers & nonbelievers.
Now turning to the data, it turns out the "believers" in author’s sample had a mean CRT of 0.86, SEM = .07. The "nonbelievers" had a mean CRT score of 0.64, SEM =0.05.
I calculate the a difference as 0.22, SEM = 0.08.
Again, it doesn’t matter that this difference is “statistically significant”—at p < 0.01 in fact. What we want to know is the inferential import of this data for our competing hypotheses. Which one does it support more—and how much more supportive is it?
As indicated at the beginning, a really good (or Good) way to gauge the weight of the evidence in relation to competing study hypotheses is through the use of Bayesian likelihood ratios. To calculate them, we look at where the observed difference in mean CRT scores falls in the respective probability density distributions associated with the “no goddam difference” and “believers more reflective” hypotheses.
By comparing how probable it is that we’d observe such a value under each hypothesis, we get the Bayesian likelihood ratio, which is how much more consistent the data are with one hypothesis than the other:
The author’s data are thus roughly 2000 times more consistent with the “no goddam difference” prediction than with the “believers more reflective” prediction.
Roughly! Figuring out the exact size of this likelihood ratio is not important.
All that matters—all I’m using the likelihood ratio, heuristically, to show—is that we can now see that, given what we know CRT is capable of measuring among groups whose scores are so close to the population mean, that the size of the observed difference in mean CRT scores is orders of magnitude more consistent with the “no goddam difference” hypothesis than with the “believers more reflective” hypothesis, notwithstanding its "stastical significance."
That’s exactly why it’s not a surprise that a predictive model based on CRT scores does no better than a model that just uses the population (or sample) frequency to predict whether any given student (regardless of his or her CRT scores) believes in in evolution.
Constructing a Bayesian likelihood ratio here was so much fun that I’m sure you’ll agree we should do it one more time.
In this one, I’m going to re-analyze data from another study I recently did a post on: Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology,” Judgment and Decision Making, July 2015, pp. 314–331, by Deppe, Gonzalez, Neiman, Jackson Pahlke, the previously mentioned Kevin Smith & John Hibbing.
Here the authors reported data on the correlation between CRT scores and individuals identified with reference to their political preferences. They reported that CRT scores were negatively correlated (p < 0.05) with various conservative position “subscales” in various of their convenience samples, and with a “conservative preferences overall” scale in a stratified nationally representative sample. They held out these results as “offer[ing] clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives.”
As I pointed out in my earlier post, I thought the authors were mistaken in reporting that their data showed any meaningful correlation—much less a statistically significant one—with “conservative preferences overall” in their nationally representative sample; they got that result, I pointed out, only because they left 2/3 of the sample out of their calculation.
I did point out, too, that the reported correlations seemed way to small, in any case, to support the conclusion that “liberals” are “more reflective” than conservatives. It was Smith’s responses in correspondence that moved me to try to formulate in a more systematic way an answer to the question that a p-value, no matter how miniscule, begs: namely, just “how big” a difference two groups “true” mean CRT scores has to be before one can declare one to be “more reflective,” “analytical,” “open-minded,” etc. than the another.
Well, let’s use likelihood ratios to measure the strength of the evidence in the data in just the 1/3 of the nationally representative sample that the authors used in their paper.
Once more, I’ll assume that “conservatives” are about average in CRT—0.7.
So again, the "liberal more reflective" hypothesis predicts we should expect to find that liberals will have a mean score 0.8 points higher than the population mean, or 1.5 correct. That’s the minimum difference for group means on CRT necessary to register a difference for a group to be deemed more reflective than another whose scores are close to the population mean.
Again, the “no goddam difference” hypothesis predicts the "null": here no difference whatsoever in mean CRT scores of liberal & conservatives.
By my calculation, in the subsample of the data in question “conservatives” in (individuals above mean on the “conservative positions overall” scale) have a mean CRT of 0.55, SE = 0.08; “liberals” a mean score of 0.73, SE = 0.08.
The estimated difference (w/ rounding) in means is 0.19, SE = 0.09.
So here is the likelihood ratio assessment of the relative support of the evidence for the two hypotheses:
Again, the data are orders of magnitude more consistent with “makes no goddam difference.”
Once more, whether the difference is “5x10^3” or 4.6x10^3 or even 9.7x10^2 or 6.3x10^4 is not important.
What is is that there’s clearly much much much more reason for treating this data as supporting an inference diametrically opposed to the one drawn by the authors.
Or at least there is if I’m right about how to specify the range of possible observations we should expect to see if the “makes no goddam difference” hypothesis is true and the range of possible observations we should expect to see if the “liberals are more reflective than conservatives” hypotheses is true.
Are those specifications correct?
Maybe not! They're just the best ones I can come up with for now!
If someone sees a problem & better still a more satisfying solution, it would be very profitable to discuss that!
What's not even worth discussing, though, is that "rejecting the null at p<0.05" is the way to figure out if the data supports the strong conclusions these papers purport to draw-- becaues in fact, that information does not support any particular inference on its own.
The point here isn’t to suggest any distinctive defects in these papers, both of which actually report interesting data.
Again, these are just illustrations of the manifest deficiency of NHT, and in particular the convention of treating “rejection of the null at p < 0.05”—by itself! – as license for declaring the observed data as supporting a hypothesis, much less as “proving” or even furnishing “strong,” “convincing” etc. evidence in favor of it.
And again in applying this critique to these particular papers, and in using Bayesian likelihood ratios to liberate the inferential significance locked up in the data, I’m not doing anything the least bit original!
On the contrary, I’m relying on arguments that were advanced over 50 years ago, and that have been strengthened and refined by myriad super smart people in the interim.
For sure, exposure of the “NHT fallacy” reflected admirable sophistication on the part of those who developed the critique.
But as I hope what I’ve showing the last couple of posts is that the defects in NHT that these scholars identified is really really easy to understand. Once it’s been pointed out; any smart middle schooler can readily grasp it!
So what the hell is going on?
I think the best explanation for the persistence of the NHT fallacy is that it is a malignant craft norm.
Treating “rejection of the null at p < 0.05” as license for asserting support of one’s hypothesis is “just the way the game works,” “the way it’s done.” Someone being initiated into the craft can plainly see that in the pages of the leading journals, and in the words and attitudes—the facial expressions, even—of the practitioners whose competence and status is vouched for by all of their NHT-based publications and by the words, and attitudes (and even facial expressions even) of other certified members of the field.
Most of those who enter the craft will therefore understandably suppress whatever critical sensibilities might otherwise have altered them to the fallacious nature of this convention. Indeed, if they can’t do that, they are likely to find the path to establishing themselves barred by jagged obstacles.
The way to progress freely down the path is to produce and get credit and status for work that embodies the NHT fallacy. Once a new entrant gains acceptance that way, then he or she too acquires a stake in the vitality of the convention, one that not only reinforces his or her aversion to seriously interrogating studies that rest on the fallacy but that also motivates him or her to evince thereafter the sort of unquestioning, taken-for-granted assent that perpetuates the convention despite its indisputably fallacious character.
And in case you were wondering, this diagnosis of the malignancy of NHT as a craft norm in the social sciences is not the least bit original to me either! It’s was Rozenboom’s diagnosis over 50 yrs ago.
So I guess we can see it’s a slow-acting disease. But make no mistake, it’s killing its host.
Cohen, J. The Earth is Round (p < .05). Am Psychol 49, 997 - 1003 (1994).
Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research.Psych Rev 70, 193 - 242 (1963).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999a).
Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med 130, 995 - 1004 (1999b).
Rozeboom, W.W. The fallacy of the null-hypothesis significance test. Psychological bulletin 57, 416 (1960).
Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).
Identity-protective cognition and accuracy
Identity-protective cognition is a form of motivated reasoning—an unconscious tendency to conform information processing to some goal collateral to accuracy (Kunda, 1990). In the case of identity-protective cognition, that goal is protection of one’s status within an affinity group whose members share defining cultural commitments.
Sometimes (for reasons more likely to originate in misadventure than conscious design) positions on a disputed societal risk become conspicuously identified with membership in competing groups of this sort. In those circumstances, individuals can be expected to attend to information in a manner that promotes beliefs that signal their commitment to the position associated with their group (Sherman & Cohen, 2006; Kahan, 2015b).
We can sharpen understanding of identity-protective reasoning by relating this style of information processing to a nuts-and-bolts Bayesian one. Bayes’s Theorem instructs individuals to revise the strength of their current beliefs (“priors”) by a factor that reflects how much more consistent the new evidence is with that belief being true than with it being false. Conceptually, that factor—the likelihood ratio—is the weight the new information is due. Many cognitive biases (e.g., base rate neglect, which involves ignoring the information in one’s “priors”) can be understood to reflect some recurring failure in people’s capacity to assess information in this way.
That’s not quite what’s going on, though, with identity-protective cognition. The signature of this dynamic isn’t so much the failure of people to “update” their priors based on new information but rather the role that protecting their identities plays in fixing the likelihood ratio they assign to new information. In effect, when they display identity-protective reasoning, individuals unconconsciously adjust the weight they assign to evidence based on its congruency with their group’s position (Kahan, 2015a).
If, e.g., they encounter a highly credentialed scientist, they will deem him an “expert” worthy of deference on a particular issue—but only if he is depicted as endorsing the factual claims on which their group’s position rests (Fig. 1) (Kahan, Jenkins-Smith, & Braman, 2011). Likewise, when shown a video of a political protest, people will report observing violence warranting the demonstrators’ arrest if the demonstrators’ cause was one their group opposes (restricting abortion rights; permitting gays and lesbians to join the military)—but not otherwise (Kahan, Hoffman, Braman, Evans, & Rachlinski, 2012).
In fact, Bayes’s Theorem doesn’t say how to determine the likelihood ratio—only what to do with the resulting factor: multiply one’s prior odds by it. But in order for Bayesian information processing to promote accurate beliefs, the criteria used to determine the weight of new information must themselves be calibrated to truth-seeking. What those criteria are might be open to dispute in some instances. But clearly, whose position the evidence supports—ours or theirs?—is never one of them.
The most persuasive demonstrations of identity-protective cognition show that individuals opportunistically alter the weight they assign one and the same piece of evidence based on experimental manipulation of the congruence of it with their identities. This design is meant to rule out the possibility that disparate priors or pre-treatment exposure to evidence is what’s blocking convergence when opposing groups evaluate the same information (Druckman, 2012).
But if this is how people assess information outside the lab, then opposing groups will never converge, much less converge on the truth, no matter how much or how compelling the evidence they receive. Or at least they won’t so long as the conventional association of positions with loyalty to opposing identify-defining groups remains part of their “objective social reality.”
Frustration of truth-convergent Bayesian information processing is the thread that binds together the diverse collection of cognitive biases of the bounded-rationality paradigm. Identity-protective cognition, we’ve seen, frustrates truth-convergent Bayesian information processing. Thus, assimilation of identity-protective reasoning into the paradigm—as has occurred within both behavioral economics (e.g., Sunstein, 2006, 2007) and political science (e.g., Taber & Lodge, 2013)— seems perfectly understandable.
Understandable, but wrong!
The bounded-rationality paradigm rests on a particular conception of dual-process reasoning. This account distinguishes between an affect-driven, “heuristic” form of information processing, and a conscious, “analytical” one. Both styles—typically referred to as System 1 and System 2, respectively—contribute to successful decisionmaking. But it is the limited capacity of human beings to summon System 2 to override errant System 1 intuitions that generates the grotesque assortment of mental miscues—the “availability effect,” “hindsight bias,” the “conjunction fallacy,” “denominator neglect,” “confirmation bias”—on display in decision science’s benighted picture of human reason (Kahneman & Frederick, 2005).
It stands to reason, then, that if identity-protective cognition is properly viewed as a member of bounded-rationality menagerie of biases, it, too, should be most pronounced among people (the great mass of the population) disposed to rely on System 1 information processing. This assumption is commonplace in the work reflecting the bounded-rationality paradigm (e.g., Lilienfeld, Ammirati, & Lanfield 2009; Westen, Blagov, Karenski, Kilts, & Hamann, 2006).
But actual data are to the contrary. Observational studies consistently find that individuals who score highest on the Cognitive Reflection Test and other reliable measures of System 2 reasoning are not less polarized but more so on facts relating to divisive political issues (e.g., Kahan et al., 2012).
Experimental data support the inference that these individuals use their distinctive analytic proficiencies to form identity-congruent assessments of evidence. When assessing quantitative data that predictably trips up those who rely on System 1 processing, individuals disposed to use System 2 are much less likely to miss information that supports their groups’ position. When the evidence contravenes their group’s position, these same individuals are better able to explain it away (Kahan, Peters, Dawson, & Slovic, 2013).
Another study that fits this account addresses the tendency of partisans form negative impressions of their opposing number (Fig. 2). In the study, subjects selectively credited or dismissed evidence of the validity of the CRT as an “open-mindedness” test depending on whether the subjects were told that individuals who held their political group’s position on climate change had scored higher or lower than those who held the opposing view. Already large among individuals of low to modest cognitive reflection, this effect was substantially more pronounced among those who scored the highest on the CRT (Kahan, 2013b).
The tragic conflict of expressive rationality
As indicated, identity-protective reasoning is routinely included in the roster of cognitive mechanisms that evince bounded rationality. But where an information-processing dynamic is consistently shown to be magnified, not constrained, by exactly the types of reasoning proficiencies that counteract the mental pratfalls associated with heuristic information processing, then one should presumably update one’s classification of that dynamic as a “cognitive bias.”
In fact, the antagonism between identity-protective cognition and perceptual accuracy is not a consequence of too little rationality but too much.
Nothing an ordinary member of the public does as consumer, as voter, or participant in public discourse will have any effect on the risk that climate change poses to her or anyone else. Same for gun control, fracking, and nuclear waste disposal: her actions just don’t matter enough to influence collective behavior or policymaking.
But given what positions on these issues signify about the sort of person she is, adopting a mistaken stance on one of these in her everyday interactions with other ordinary people could expose her to devastating consequences, both material and psychic. It is perfectly rational under these circumstances to process information in a manner that promotes formation of the beliefs on these issues that express her group allegiances, and to bring all her cognitive resources to bear in doing so.
Of course, when everyone uses their reason this way at once, collective welfare suffers. In that case, culturally diverse democratic citizens won’t converge, or converge as quickly, on the significance of valid evidence on how to manage societal risks. But that doesn’t change the social incentives that make it rational for any individual—and hence every individual—to engage information in this way.
Only some collective intervention—one that effectively dispels the conflict between the individual’s interest in forming identity-expressive risk perceptions and society’s interest in the formation of accurate ones—could (Kahan et al., 2012; Lessig, 1995).
Rationality ≠ accuracy (necessarily)
. . . . Obviously, it isn’t possible to assess the “rationality” of any pattern of information processing unless one gets what the agent processing the information is trying to accomplish. Because forming accurate “factual perceptions” is not the only thing people use information for, a paradigm that motivates empirical researchers to appraise cognition exclusively in relation to that objective will indeed end up painting a distorted picture of human thinking.
But worse, the picture will simply be wrong. The body of science this paradigm generates will fail, in particular, to supply us with the information a pluralistic democratic society needs to manage the forces that creat the conflict betwen the stake citizens’ have in using their reason to know what’s known and using it to be who they are as members of diverse cultural groups (Kahan, 2015b).
Akerlof, G. A., & Kranton, R. E. (2000). Economics and Identity. Quarterly Journal of Economics, 115(3), 715-753.
Anderson, E. (1993). Value in ethics and economics. Cambridge, Mass.: Harvard University Press.
Druckman, J. N. (2012). The Politics of Motivation. Critical Review, 24(2), 199-216.
Kahan, D. M., Peters, E., Wittlin, M., Slovic, P., Ouellette, L. L., Braman, D., & Mandel, G. (2012). The polarizing impact of science literacy and numeracy on perceived climate change risks. Nature Climate Change, 2, 732-735.
Kahneman, D., & Frederick, S. (2005). A model of heuristic judgment. The Cambridge handbook of thinking and reasoning, 267-293.
Kunda, Z. (1990). The Case for Motivated Reasoning. Psychological Bulletin, 108, 480-498.
Lessig, L. (1995). The Regulation of Social Meaning. U. Chi. L. Rev., 62, 943-1045.
Lilienfeld, S. O., Ammirati, R., & Landfield, K. (2009). Giving Debiasing Away: Can Psychological Research on Correcting Cognitive Errors Promote Human Welfare? Perspectives on Psychological Science, 4(4), 390-398.
Lodge, M., & Taber, C. S. (2013). The rationalizing voter. Cambridge ; New York: Cambridge University Press.
Peirce, C. S. (1877). The Fixation of Belief. Popular Science Monthly, 12, 1-15.
Sherman, D. K., & Cohen, G. L. (2006). The Psychology of Self-defense: Self-Affirmation Theory Advances in Experimental Social Psychology (Vol. 38, pp. 183-242): Academic Press.
Sunstein, C. R. (2006). Misfearing: A reply. Harvard Law Review, 119(4), 1110-1125.
Sunstein, C. R. (2007). On the Divergent American Reactions to Terrorism and Climate Change. Columbia Law Review, 107, 503-557.
Westen, D., Blagov, P. S., Harenski, K., Kilts, C., & Hamann, S. (2006). Neural Bases of Motivated Reasoning: An fMRI Study of Emotional Constraints on Partisan Political Judgment in the 2004 U.S. Presidential Election. Journal of Cognitive Neuroscience, 18(11), 1947-1958.
So, as I promised “yesterday,” here are some additional reflections on the deficiencies of “null hypothesis testing” (NHT).
Actually, my objection is to the convention of permitting researchers to treat “rejection of the null, p < 0.05” as evidence for crediting their study hypotheses.
In one fit-statistic variation or another, “p < 0.5” is the modal “reported result” in social science research.
But the idea that a p-value supports any inference from the data is an out-and-out fallacy of the rankest sort!
Because of measurement error, any value will have some finite probability of being observed whatever the “true” value of the quantity being measured happens to be. Nothing at all follows from learning that the probability of obtaining the precise value observed in a particular sample was less 5%—or even less than 1% or less than 0.000000001%—on the assumption that true value is zero or any other particular quantity.
What matters is how much more or less likely the observed result is in relation to one hypothesized true value than another. From that information, we can determine the inferential significance of the data: that is, we can determine whether the data support a particular hypothesis, and if so, how strongly. But if we don’t have that information at our disposal and a researcher doesn’t supply it, then anything the researcher says about his or her data is literally meaningless.
This is likely to seem obvious to most of the 14 billion readers of this blog. It is--thanks to a succession of super smart people who've helped to spell out this "NHT fallacy" critique (e.g., Rozeboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999, 1999; Gigerenzer 2004).
As these critics note, though, the problem with NHT is that it supplies a mechanical testing protocol that elides these basic points. Researchers who follow the protocol can appear to be furnishing us with meaningful information even if they are not.
Or worse, they can declare that a result that is “significant at p < 0.05” supports all manner of conclusions that it just doesn’t support—because as improbable as it might have been that the reported result would be observed if the “true” value were zero, the probability of observing such a result if the researcher’s hypothesis were true is even smaller.
2. This straw man has legs
I know: you think I’m attacking a straw man.
I might be. But that straw man publishes a lot of studies. Let me show you an example.
In one recent paper--one reporting the collection of a trove of interesting data that definitely enrich scholarly discussion-- a researcher purported to test the “core hypothesis” that “analytic thinking promotes endorsement of evolution.”
That researcher, a very good scholar, reasoned that if this was so, “endorsement of evolution” ought to be correlated with “performance on an analytic thinking task.” The task he chose was the Cognitive Reflection Test (Frederick 2005), the leading measure of the capacity and motivation of individuals to use conscious, effortful “System 2” information processing rather than intuitive, affect-driven “System 1” processing.
After administering a survey to a sample of University of Kentucky undergraduates, the researcher reported finding the predicted correlation between the subjects' CRT scores and their responses to a survey item on beliefs in evolution (p < 0.01). He therefore concluded:
If you are nodding your head at this point, you really shouldn’t be. This is not nearly enough information to know whether the author’s data support any of the inferences he draws.
In fact, they demonstrably don’t.
Here is a model in which belief in science's understanding of evolution (i.e., one that doesn't posit "any supreme being guid[ing] ... [it] for the purpose of creating humans') is regressed on the CRT scores of the student-sample respondents:
The outcome variable is the probability that a student will believe in evolution.
If, as the author concludes, “analytic thinking consistently predicts endorsement of evolution,” then we should be able to use this model to, well, predict whether subjects in the sample believe in evolution, or at least to predict that with a higher degree of accuracy than we would be able to without knowing the subjects’ CRT scores.
But we can’t.
Yes, just as the author reported, there is a positive & significant correlation coefficient for CRT.
But look at the "Count" & "Adjusted Count" R^2s.
The first reports the proportion of subjects whose “belief in evolution” was correctly predicted (based on whether the predicted probability for them was > or < 0.50): 62%.
That's exactly the proportion of the sample that reports not to believe in evolution.
As a result, the "adjusted count R^2" is "0.00." This statistic reflects the proportion of correct predictions the model makes in excess of the proportion one would have made by just predicting the most frequent outcome in the sample for all the cases.
Imagine a reasonably intelligent person were offered a prize for correctly “predicting” any study respondent’s “beliefs” knowing only that a majority of the sample purported not to accept science's account of the natural history of human beings. Obviously, she’d “predict” that any given student “disbelieves” in evolution. This “everyone disbelieves” model would have a predictive accuracy rate of 62% were it applied to the entire sample.
Knowing each respondent's CRT score would not enable that person to predict “beliefs” in evolution with any greater accuracy than that! The students’ CRT scores, in other words, are useless, predictively speaking.
Here's a classification table that helps us to see exactly what's happening:
The CRT predictor, despite being "positive" & "significant," is so weak that the regression model that included it just threw up its hands and defaulted to the "everyone disbelieves” strategy.
The reason the “significant” difference in the CRT scores of believers & nonbelievers in the sample doesn’t support the author's conclusion-- that “analytic thinking consistently predicts endorsement of evolution”--is that the size of the effect isn’t nearly as big as it would have to be to furnish actual evidence for his hypothesis (something that one can pretty well guess is the case by just looking at the raw data).
Indeed, as the analysis I’ve just done illustrates, the observed effect is actually more consistent with the prediction that “CRT makes no goddam difference” in what people say they believe about the natural history of human beings.
Why the hell (excuse my French) would we expect any other result? As I’ve pointed out 17,333,246 times, answers to this facile survey question do not reflect respondents' science comprehension; they express their cultural identity!
But that's not a very good reply. Empirical testing is all about looking for surprises, or at least holding oneself open to the possibility of being surprised by evidence that cuts against what one understands to be the truth.
That didn't happen, however, in this particular case.
Actually, I should point out the author constructs two separate models: one relating CRT to the probability that someone will believe in “young earth creationism” as opposed to “evolution according to a divine plan”—something akin to “intelligent design”; and another relating CRT to the probability that someone will believe in “young earth creationism” as opposed “evolution without any divine agency”—science’s position. It seems odd to me to do that, given that the author's theory was that “analytic thinking tends to reduce belief in supernatural agents.”
So my model just looks at see whether CRT scores predict someone believes in science’s-view of evolution—man evolves without any guidance form or plan by God—vs. belief in any alternative account. That’s why there is a tiny discrepancy between my logit model’s "odds ratio" coefficient for CRT (OR = 1.23, p < 0.01) and the author’s (OR = 1.28, p < 0.01).
But it doesn’t matter. The CRT scores are just as useless for predicting simply whether someone believes in “young earth” creationism versus either “intelligent design” or the modern synthesis. Thirty-three percent of the author’s Univ. Ky undergrad sample reported believing in “young earth creationism.” A model that regresses that “belief” on CRT classifies everyone in the sample as rejecting that position, and thus gets a predictive accuracy rate of 67%.
3. What’s the question?
So there you go: a snapshot of the pernicious vitality of the NHT fallacy in action. A researcher who has in fact collected some very interesting data announces empirical support for a bunch of conclusions that aren’t supported by them. What licenses him to do is a “statistically significant” difference between an observed result and a value—zero difference in mean CRT scores—that turns out to be way too small to support his hypothesis.
The relevant question, inferentially speaking, is,
How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don't?
That’s a super interesting problem, the sort one actually has use reflection to solve. It's one I hadn't thought hard enough about until engaging the author's interesting study results. I wish the author, a genuinely smart guy, had thought about it in analyzing his data.
I’ll give this problem a shot myself “tomorrow.”
For now, my point is simply that the convention of treating "p < 0.05" as evidence in support of a study hypothesis is what prevents researchers from figuring out what question they should actually be posing to their data.
Cohen, J. The Earth is Round (p < .05). Am Psychol 49, 997 - 1003 (1994).
Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research. Psych Rev 70, 193 - 242 (1963).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999).
Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med 130, 995 - 1004 (1999).
Rozeboom, W.W. The fallacy of the null-hypothesis significance test. Psychological bulletin 57, 416 (1960).
Big showdown between System 1 & System 2 on Thursday. Come on down & root for your favorite team!
My strategy is to talk super fast so that Shane Frederick doesn't have enough time to reflect on the nature of my arguments & spot the holes in them!
For cool David Rand discussion of "intuition," "reflection," & cooperation & related forms of beneficence, see this cool piece from NY Times:
As the 14 billion readers of this blog know, I’m interested in the relationship between cognition and political outlooks. Is there a connection between critical reasoning dispositions and left-right ideology? Does higher cognitive proficiency of one sort or another counteract the tendency of people to construe empirical data in a politically biased way?
The answer to both these questions, the data I’ve collected persuades me, is, No.
But as I explained just the other day, if one gets how empirical proof works, then one understands that any conclusion one comes to is always provisional. What one “believes” about some matter that admits of empirical inquiry is just the position one judges to be most supported by the best available evidence now at hand.
So I was excited to see the paper “Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology,” Judgment and Decision Making, July 2015, pp. 314–331, by Deppe, Gonzalez, Neiman, Jackson Pahlke, Smith & Hibbing.
Deppe et al. report the results from a number of studies on critical reasoning and political ideology. The one that got my attention was one in which Deppe et al. reported that they had found “moderately sized negative correlations between CRT scores and conservative issue preferences” in a “nationally representative” sample” (pp. 316, 320).
As explained 9,233 times on this blog, the CRT is the standard assessment instrument used to measure the disposition of individuals to engage in effortful, conscious “System 2” information processing as opposed to the intuitive, heuristic “System 1” sort associated with myriad cognitive biases (Frederick 2005).
It was really really important, Deppe et al. recognized, to use a stratified general population sample recruited by valid means to test the relationship between political outlooks and CRT.
Various other studies, they noted, had relied on samples that don’t support valid inferences the relationship between cognitive style and political outlooks. These included M Turk workers, whose scores on the CRT are unrealistically high (likely b/c they’ve been repeatedly exposed to it); who underrepresent conservatives, and thus necessarily include atypical ones; and who often turn out to be non-Americans disguising their identities (Chandler, Mueller, & Paolacci 2014; Krukpnikov & Levine 2014; Shapiro,Chandler, & Mueller 2013).
Other scholars, Deppe et al. noted, have constructed samples from “visitors to a web site” on cognition and moral values who were expressly solicited to participate in studies in exchange for finding out about the relationship between the two in themselves. As a reflective colleague pointed out, this not particularly reflective sampling method is akin to polling ESPN.com visitors to try to figure out what the frequency of “liking football” is among different groups in the general population.
The one study Deppe et al. could find that used a valid general population sample to examine the correlation between CRT scores and right-left political outlooks was one I had done (Kahan 2013). And mine, they noted, had found no meaningful correlation.
Deppe et al. attributed the likely difference in our results to the way in which they & I measured political orientations. I used a composite measure that combined responses to standard, multi-point conservative-liberal ideology and party self-identification measures. But “self-reported ideology,” they observed, “is well-known to be a highly imperfect indicator of individual issue preferences.”
So instead they measured such preferences, soliciting their subjects responses to a variety of specific policies, including gay marriage, torture of terrorist subjects, government health insurance, and government price controls (a goody but oldie; “liberal” Richard Nixon was the last US President to resort to this policy).
On the basis of these responses they formed separate “Economic,” “Moral,” and “Punishment” “conservative policy-preference” scales. The latter two, but not the former, had a negative correlation with CRT, as did a respectably reliable scale (α =0.69) that aggregated all of these positions.
Having collected data from a Knowledge Networks sample “to determine if the findings” they obtained with M Turk workers “held up in a more representative sample” (p. 319), they heralded this result as “offer[ing] clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives.”
That’s pretty interesting!
So I decided I should for sure to take the study into account in my own perpetual weighing of the evidence on how critical reasoning relates to political outlooks and comparable indicators of cultural identity.
I downloaded their data from JDM website with the intention of looking it over and then seeing if I could replicate their findings with nationally representative datasets of my own that had liberal and conservative policy positions and CRT scores.
Well, I was in fact able to replicate the results in the Deppe et al. data.
However, what I ended up replicating were results materially different from what Deppe et al. had actually reported. . . .
Deppe et al. had collected their CRT and political-position data as part of a “priming” experiment. The idea was to see if subjects’ political outlooks became more or less conservative when induced or “primed” to rely either on “reflection,” of the sort associated with System 2 reasoning, or on “intuition,” of the sort associated with System 1.
They thus assigned 2/3 of their subjects randomly to distinct “reflection” and “intuition” conditions. Both were given word-unscrambling puzzles that involved dropping one of five words and using the other four to form a sentence. The sentences that a person could construct in the “reflection” condition emphasized use of reflective reasoning (e.g., “analyze the numbers carefully”; “I think all day”), while those in the “intuition” condition emphasized the use of intuitive” reasoning (e.g., “Go with your gut”; “she used her instinct”).
The remaining 1/3 of the sample got a “neutral prime”: a puzzle that consisted of dropping and unscrambling words to form statements having nothing to do with either reflection or intuition (e.g., “the sky is blue”; “he rode the train”).
Deppe et al.’s hypothesis was that “subjects receiving an intuitive prime w[ould] report more conservative attitudes” and those “receiving a reflective prime . . . more liberal attitudes,” relative to “those receiving a “neutral prime.”
Well, the experiment didn’t exactly come out as planned. Statistical analyses, they reported (p. 320),
show[ed] no differences in the number of correct CRT answers provided by the subjects between any group, indicating that the priming protocol manipulation . . . failed to induce any higher or lower amounts of reflection. With no differences in thinking style, again unsurprisingly, there were no statistically significant differences between the groups on self-reported ideology or issue attitudes.
But I discovered that the results were actually way more interesting that!
There may have been “no differences” in the CRT scores and “conservative issue preferences” of subjects assigned to different conditions, but it’s not true there were no differences in the correlation between these two variables in the various conditions: in both the “reflection” and “intuition” conditions, subjects scoring higher on the CRT adopted “significantly” more conservative policy stances than their counterparts in the “neutral priming” condition! By the same token, subjects scoring lower in CRT necessarily became more liberal in their policy stances in the "reflection" & "intuition" conditions.
Wow! That’s really weird!
If one took the experimental effect seriously, one would have to conclude that priming individuals for “reflection” makes those who are the most capable and motivated to use System 2 reasoning (the conscious, effortful, analytic type) become more conservative--and that priming these same persons for “intuition” makes them more conservative too!
Deppe et al. don’t report this result. Likely they concluded, quite reasonably, that this whacky, atheoretical outcome was just noise, and that the only thing that mattered was that the priming experiment just didn’t work (same for the ones they attempted on M Turk workers, and same for a whole bunch of “replications” of classic studies in this genre).
But here’s the rub.
The “moderately sized negative correlation between CRT scores and conservative issue preferences overall” that Deppe et al. report finding in their "nationally representative" sample (p. 319) was based only on subjects in the “neutral prime” condition.
As I just explained, relative to the “neutral priming” condition, there was a positive relationship "between CRT scores and conservative issue preferences overall" in both the “reflection” and “intuition priming” conditions.
If Deppe et al. had included the subjects from the latter two conditions in their analysis of the results of study 2, they wouldn’t have detected any meaningful correlation –positive or negative—“between CRT scores and conservative issue preferences overall” in their critical “more representative sample.”
It doesn’t take a ton of reflection to see why, under these circumstances, it is simply wrong to characterize the results in study 2 as furnishing “correlational evidence to support the hypothesis that higher CRT scores are associated with being liberal.”
For purposes of assessing how CRT and conservatism relate to one another, being assigned to the "neutral priming" condition was no more or less a "treatment" than being assigned to the “intuition" and "reflection" conditions. The subjects in the "neutral prime" condition did a word puzzle—just as the subjects in the other treatments did. Insofar as the experimental assignment didn't didn't generate "differences in the number of correct CRT answers" or in "issue attitudes" between the conditions (p. 320), then either no one was treated for practical purposes or everyone was but in the same way: by being assigned to do a word puzzle that had no effect on ideology or CRT scores.
Of course, the correlations between conservative policy positions and CRT did differ between conditions. As I pointed out, Deppe et al. understandably chose not to report that their “priming” experiment had "caused" individuals high in System 2 reasoning capacity to become more conservative (and those low in System 2 reasoning correspondingly more liberal) both when “primed” for “reflection” and when “primed” for intuition. The more sensible interpretation of their weird data was that the priming manipulation had no meaningful effect on either conservativism or CRT scores.
But if one takes that very reasonable view, then it is unreasonable to treat the CRT-conservatism relationship in the “neutral priming” condition as if it alone were the “untreated” or “true” one.
If the effects of experimental assignments are viewed simply as noise—as I agree they should be!—then the correct way to assess the relationship between CRT & conservatism in study 2 is to consider the responses of subjects from all three conditions.
An alternative that would be weird but at least fully transparent would be to say that “in 2 out of 3 ‘subsamples,’ ” the “more representative sample” failed to “replicate” the negative conservative-CRT correlation observed in their M Turk samples.
But the one thing that it surely isn’t justifiable is to divide the sample into 3 & then report the data from the one subsample that happens to support the authors' hypothesis -- that conservatism & CRT are negatively correlated -- while simply ignoring the contrary results in the other two.
I’m 100% sure this wasn’t Deppe et al.’s intent, but by only partially reporting the data from their "nationally representative sample" Deppe et al. have unquestionably created a misimpression. There's just no chance any reader would ever have guessed that the data looked like this given their description of the results—and no way a reader apprised of the real results would ever agree that their "more representative sample" had "replicated" their M Turk sample finding of a “negative correlation between CRT scores and conservative issue preferences overall” (p. 320).
5. Replicating Deppe et. al.
As I said, I was intrigued by Deppe et al.’s claim that they had found a negative correlation between conservative policy positions and CRT scores and wanted to see if I could replicate their finding in my own data set.
It turns out their study didn’t find the negative correlation they reported, though, when one includes responses of the 2/3 of the subjects unjustifiably omitted from their analysis of the relationship between CRT scores and conservative policy positions.
Well, I didn’t find any such correlation either when I performed a comparable data analysis on a large (N = 1600) nationally representative CCP (YouGov) study sample from 2012—one in which subjects hadn’t been assigned to do any sort of word-unscrambling puzzle before taking the CRT.
In my sample, subjects responded to this “issues positions” battery:
The responses formed two distinct factors, one suggesting a disposition to support or oppose legalization of prostitution and legalization of marijuana, and the other a disposition to support or oppose liberal policy positions on the remaining issues except for resumption of the draft, which loaded on neither factor.
Reversing the signs of the factor scores, I suppose one could characterize these as “social” and “economic_plus” conservativism respectively .
Both had very very small but “significant” correlations with CRT.
Not surprisingly, then, these two canceled each other out (r = -0.01, p = 0.80) when one examined “conservative policy positions overall”—i.e., all the policy positions aggregated into a single scale (α = 0.80).
That is exactly what I found, too, when I included the 2/3 of the subjects that Deppe et al. excluded from their report of the correlation between CRT and conservative policy positions in Study 2. That is, if one takes their conservative subdomain scales as Deppe et al. formed them, there is a small negative correlation between CRT and “Punishment” conservativism ( r = -0.13, p < 0.01) but a small positive one (r = 0.17, p < 0.01) between CRT and “Economic conservativism.”
There is another, even smaller negative correlation between CRT and the “Moral” conservative policy position scale (r = - 0.08, p = 0.08).
That—and not any deficiency in conventional left-right ideology measures (ones routinely used by the “neo-authoritarian personality” scholars (Jost et al 2003) that Deppe et al. cite their own study as supporting)— also explains why there is zero correlation between CRT and liberal-conservative ideology and partisan self-identification.
In any event, when one simply looks at all the data in a fair-minded way, one is left with nothing—and hence nothing that supplies anyone with any reason to revise his or her views on the relationship between political outlooks and critical reasoning capacities.
6. Yucky NHT--again
One last point, again on the vices of “null hypothesis testing.”
Because they were so focused on their priming experiment non-result, I’m sure it just didn’t occur to Deppe et al. that it made no sense for them to exclude 2/3 of their sample when computing the relationship between conservativism and CRT scores in Study 2.
But here’s something I think they really should have thought a bit more about. . . . Even if the results in their study were exactly as they reported, the correlations were so trivially small that they could not, in my view, reasonably support a conclusion so strong (not to mention so clearly demeaning for 50% of the U.S. population!) as
We find a consistent pattern showing that those more likely to engage in reflection are more likely to have liberal political attitudes while those less likely to do so are more likely to have conservative attitudes....
...The results of the studies reported above offer clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives....
I’ll say more about that “tomorrow,” when I return to a theme briefly touched on a couple days ago on the common NHT fallacy that statistical “significance” conveys information on the weight of the evidence in relation to a study hypothesis.
Chandler, J., Mueller, P. & Paolacci, G. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior research methods 46, 112-130 (2014).
Deppe, K.D., Gonzalez, F.J., Neiman, J.L., Jacobs, C., Pahlke, J., Smith, K.B. & Hibbing, J.R. Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology. Judgment and Decision Making 10, 314-331 (2015).
Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).
Jost, J.T., Glaser, J., Kruglanski, A.W. & Sulloway, F.J. Political Conservatism as Motivated Social Cognition. Psych. Bull. 129, 339-375 (2003).
Krupnikov, Y. & Levine, A.S. Cross-Sample Comparisons and External Validity. Journal of Experimental Political Science 1, 59-80 (2014).
So Will Gervais has a very artful response to my post on his evolution-CRT paper.
The gist of it is that I mischarcterized his views -- that I was addressing some other "Will Gervais," who subscribes to positions wholly unrelated to his.
But I have to say that I find Will's eagerness to distance himself from the position I attributed to him perplexing.
Gervais (I think it was him!) wrote in Cognition,
Many supernatural beliefs come easily to people, perhaps because they are supported by a variety of core intuitive processes. As with creationism, reliably developing intuitions support the mental representation of supernatural agents, such as God. However, dual process approaches to cognition suggest that at times people are able to analytically inhibit or override their intuitions.
[P]eople who are more willing or able to engage analytic thinking might be more likely to endorse evolution than people who tend to trust their intuitions. If true, then measures of analytic thinking should predict greater endorsement of evolution. In the present paper, two large studies tested this core hypothesis.
He concludes that his data support this conjecture:
Two studies revealed that—consistent with dual process approaches to cog nition in general, and supernatural cognition in particular—an analytic cognitive style predicts increased endorsement of evolution. Reliably developing intuitions may give creationist views an early cognitive advantage. This early advantage also is likely bolstered by early enculturation advantages for creationist, rather than evolutionary, concepts in many cultural contexts. However, individuals who are better able to analytically control their thoughts are more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species.
Re-analyzing his data, and primarily just showing what the actual raw data look like, I argued that the results of his study didn't support his hypothesis. That they didn't come anywhere close to supporting it.
The impact of the disposition to rely on "analytic" as opposed to "intuitive" thinking (measured by the CRT) was "statistically significant" but practically irrelevant. Even the most "analytic" thinkers in Gervais's sample did not endorse a conception of evolution free of divine agency--i.e., did not accept science's own conception of evolution as reflected in the modern synthesis.
The "Will Gervais" who wrote the very interesting Cognition paper states "analytic thinking consistently predicts endorsement of evolution."
But it doesn't. The (very modest incremental) effect of CRT on increased endorsement of evolution was confined to relatively non-religious subjects. Among relatively religious individuals, those who displayed the highest degree of cognitive reflection weren't any more likely to endorse science's account of the natural history of human beings than ones who scored the lowest.
That's not what we'd expect to see if in fact disbelief in evoultion reflected a deficit in the capacity and motivation to engage in System 2 reasoning.
This result is consistent, however, with an alterative hypothesis. At least modestly supported by existing research, this rival position denies that cognitive reflection is something antagonistic to formation of and persistence in culturally identity-defining beliefs that are opposed to scientific evidence.
On the contrary, according to this theory, individuals will use all of the cognitive resources at their disposal to form and persist in beliefs that express their cultural identities on facts that come to symbolize their group allegiances. We should thus expect those most proficient in conscious, effortful, "System 2" analytic reasoning to be even more divided on issues like climate change & evolution than those inclined to rely on "intuitive" System 1 reasoning.
Gervais's data lends more support to that hypothesis than to what he describes as his own "core hypothesis": that "measures of analytic thinking should predict greater endorsement of evolution."
I'm pretty sure that's all I said in my post, so I'm confused about why Gervais thinks I was mischaracterizing him (maybe he was blogging about another "Dan Kahan"?!).
Gervais complains that the media mischaracterized his study, too. So I took a look at the very impressive volume of press coverage the Cognition study generated.
For sure the media can get things horribly wrong, particularly when a researcher is reporting on how cognitive biases can influence perceptions of disputed issues in science.
But here, I think the media got it right. Or at least they accurately reported the finding that the "Will Gervais" who authored the article in Cognition unambiguously purported to make: "individuals who are more prone and/or able to engage in analytic thinking to override their intuitions were more likely to endorse evolution."
So I'm really curious now to know who that "Will Gervais" is. I'd also like to know what the Will Gervais who responded to me in his blog post thinks about that other Will Gervais' Cognition study; I gather he (the blog-post author Gervais) is largely in agreement with me that that the Cognition study drew conclusions not supported by the data that Gervais (not sure at this point which one) uploaded to the Cognition site.
Finally and most important of all, I'd really really like to know what the Gervais who wrote the Cognition article has to say in response to to the substance of points I made.
The questions the study addressed are really interesting & important. They are also hard; he might point out that there's something I missed--or some additional insight to be gained from the data on the relative strengths of his hypothesis and mine--in which case, I'd like to know that!
I hope that Will Gervais joins the discussion, too.
(Note: I'm closing off comments here; readers should post their responses in the comment thread for my original post-- a more sensible place, I think, for discussion. By all means respond if you have thoughts!)
I had some correspondence off-line with loyal listener @Steve (aka @sjgenco) about the classic "what does a valid measure of climate-change risk-perceptions look like graph?" Inspired by loyal listner @FrankL (now that they've finally discovered " missing Malaysia Airlines Flight MH370"--or at least a piece of it--maybe someone will find @FrankL, or at least a piece of him, too), the WDVMCCRLLG graphic has of course achieved iconic status and is pretty much ubiquitous in popular culture.
But it is pretty darn old. Isn't it time for something new? Can't we do better?
But everything, no matter how wonderful, admits of incremental improvement as human knowledge continues to expand as a result of science and improved sports drink formulas.
In response to @Steve's inquiry, I revealed the secret formula for generating the graphic. When Steve said he wasn't enamored of "jitters" as a way to handle overplotting & preferred "bubbles" scaled to reflect observation densities, I directed @Steve to a CCP dataset he could use (one posted with "codebook" the last time the CCP blog was the site for a furious display of graphic genius on the part of @thompn4) to perfect his own improvements.
Here's what he wrote back:
Hi Dan,I've been playing around with jitters in R. I like your Gervais jitters. Keeping the clouds more separate helps. That's harder to do when your x-var is continuous, like your libcon variable in your "challenge" dataset.Your dataset was like catnip so I've squandered a couple of days trying to brush up on my R to see if I could implement my bubble plot idea with your data. For what it's worth, I seem to have succeeded so I thought I'd forward my results. (I use RStudio, btw, I highly recommend it.)First, I was able to replicate your colored jitter charts in R (seems to require less code than in stata). Here's gwrisk by libcon (making the points 50% transparent also helps highlight the clustering imho):
When I figured out how to put bubbles representing the frequency of responses around each datapoint on the same plot, it looked like this:
It does show the densities nicely, I think. For comparison, here's the bubble plot for scicomp by gwrisk:
You can really see that scicomp clusters in the middle vs. libcon, and how those densities are going to generate a flat regression.You can also combine the two plots, which is kind of interesting:
Note how the jittering on libcon stretches out the values along the x-axis. There actually aren't any "real" values above 2 or below -2.I've attached a PPT with all my results, a commented R script for running the plots, and the Rdata image I created for inputting the data.It was a good excuse for digging into R again.
So what do people think? Time to retire WDVMCCRLLG? Time to adopt one of @Steve's alternatives as the new symbol of the Un-United States of Risk Perception?
Voice your opinoin --as with everything else relating to this blog, matters will be decided by a democratic vote of the site's 14 billion regular readers -- and by all means try your own hand at devising a graphic that conveys the information in WDVMCCRLLG in an even more compelling, cool way!
And if you want, you can go back to @thompn4's project to create the perfect 3D graphic presentation that incorporates in addition the impact of science comprehension in magnifying polarization over climate change risk.
I'd offer one of our standard CCP prizes, but obviously the fame of being the originator of the successor of WDVMCCRLLG is incentive enough!
@Steve has formulated some comments & additional cool graphics in response to the conversation. Here they are:
Using the square-root of the circle sizes, as @Paul Matthews suggests, does make the range of sizes less extreme. I think this is a good adjustment.
I agree that the rainbow color set is a bit "muppets", but I was trying to work in the established idiom. :)
There is a nice color palette package for R called RColorBrewer that provides a bunch of palettes to choose from, including sequential (light to dark), diverging (light in the middle, contrasting darks at the extremes), and qualitative (no sequencing implied, just sets of related colors). The graphs below use the qualitative palette "Set3".
On Paul's point about the arbitrary binning of the continuous libcon variable, that's definitely a trade-off. I think it is ameliorated if the raw data is displayed underneath the circles. I could also imagine the circles being extended into a confidence-range kind of overlay, creating a more continuous representation of the densities. As a general point, I think the arbitrary binning of the circles is less distorting of the underlying data than the jittering effect of extending the apparent range of the data points.
The basic idea behind the bubble graph is to emphasize the pattern of densities across the tables. Visually, it does this very well for my eye (even better with the sqrt transform), better than the jitter. It also has the virtue of continuing to tell its story when the image gets very small, and that is useful for eye-ball comparisons, such as seeing immediately how and where gwrisk is highly skewed across the libcon divide, while nukerisk is not:
Also, other things "jump out" in this depiction of the data. For example, you can easily see that people are much more willing to give gwrisk a "zero" than they are nukerisk. And you can also spot things like that little cluster at gwrisk=3 among the lower science comprehension folks. Perhaps "3" is a good compromise when you don't really have a good basis for an opinion?
One final note. I tried the sequential palettes, thinking they would make a good fit with the ordinal nature of the risk variables, but I found that the light colors toward the bottom tended to obscure the densities down there, compared to darker colors toward the top. This is especially true when the image is small:
Although I do like the "armageddon-like" quality of the "Reds" palette!