follow CCP

Recent blog entries

"Inherent internal contradictions" don't cause bad institutions to collapse; they just suck ... "Rules of evidence are impossible," part 3 (another report for Law & Cognition seminar)

Nope. Can't be done. Impossible.Time for part 3 of this series: Are Rules of Evidence Impossible?

The answer is yes, as I said at the very beginning.

But I didn’t say why & still haven’t.

Instead, I spent the first two parts laying the groundwork necessary for explanation.  Maybe you can build the argument on top of it yourself at this point?! If so, skip ahead to “. . . guess what?”—or even skip the rest of this post altogether & apply your reason to something likely to teach you something new!

But in the event you can’t guess the ending, or simply need your “memory refreshed” (see Fed. R. Evid. 612), a recap:

Where were we? In the first part, I described a conception of the practice of using “rules of evidence”—the Bayesian Cognitive Correction Model (BCCM). 

BCCM conceives of rules of evidence as instruments for “cognitively fine tuning” adjudication. By selectively admitting and excluding items of proof, courts can use the rules to neutralize the accuracy-diminishing impact of one or another form of of biased information processing--from identity-protective reasoning to the availability effect, from hindsight bias to baserate neglect, etc.  The threat these dynamics pose to accurate factfinding is their tendency to induce the factfinder to systematically misestimate the weight, or in Bayesian terms the “likelihood ratio” (LR), to be assigned items of proof (Kahan 2015). 

In part 2, I discussed a cognitive dynamic that has that sort of consequence: “coherence based reasoning” (CBR).

Monte carlo simulation of CBR! check it out!Under CBR (Simon 2004; Simon, Pham, Quang & Holyoak 2001; Carlson & Russo 2001), the factfinder’s motivation to find “coherence” in the trial proof creates a looping feedback effect. 

Once the factfinder forms the perception that the accumulated weight of the evidence supports one side, he begins to inflate or discount the weight of successive items of proof as necessary to conform them to that position.  He also turns around and revisits already-considered items of proof and reweights them to make sure they fit that position, too. 

His reward is an exaggerated degree of confidence in the correctness of that outcome—and thus the piece of mind that comes from never ever having to worry that maybe, just maybe he got the wrong answer.

The practical consequences are two.  First, by virtue of the exaggerated certainty the factfinder has in the result, he will sometimes rule in favor of a party that hasn’t carried its burden under a heightened standard of proof like, say, “beyond a reasonable doubt,” which reflects the law’s aversion to “Type 1” errors when citizens’ liberty is at stake.

Second, what position the factfinder comes to be convinced is right will be arbitrarily sensitive to the order of proof.  The same strong piece of evidence that a factfinder dismisses as inconsistent with what she is now committed to believing is true could have triggered a “likelihood ratio” cascade” in exactly the opposite direction had that item of proof appeared “sooner”-- in which case the confidence it instilled in its proponent's case would have infected the factfinder's evaluation of all the remaining items of proof.

If you hung around after class last time for the “extra credit”/“optional” discussion, I used a computer simulation to illustrate these chaotic effects, and to show why we should expect the accuracy-eviserating consequences of them to be visited disproportionately on innocent defendants in criminal proceedings.

This is definitely the sort of insult to rational-truth-seeking that BCCM was designed to rectify!

But guess what?

It can’t! The threat CBR poses to accuracy is one the BCCM conception of “rules of evidence” can’t possibly couneract!

As I explained in part 1, BCCM consists of three basic elements:

  1. Rule 401, understood as a presumption that evidence with LR ≠ 1 is admissible (Lempert 1977);

  2. a conception of “unfair prejudice” under Rule 403 that identifies it as the tendency of a piece of relevant evidence to induce a flesh-and-blood factfinder to assign incorrect LRs to it or other items of proof (Lempert 1977); and
  3. a strategy for Rule 403 weighing that directs the court to exclude “relevant” evidence when the tendency it has to induce the factfinder to assign the wrong LR to that or other pieces of evidence diminishes accurate assessment of the trial proof to a greater extent than constraining the factfinder to effectively treat the evidence in question as having no weight at all, or LR = 1 (Kahan 2010).

The problem is that CBR injects this “marginal probative value vs. marginal prejudice” apparatus with a form of self-contradiction, both logical and practical.

There isn’t normally any such contradiction. 

Imagine, e.g., that a court was worried that evidence of a product redesign intended to avoid a harmful malfunction might trigger “hindsight bias,” which consists in the tendency to inflate the LRs associated with items of proof that bear on how readily one might have been able to predict the need for and utility of such a design ex ante (Kamin & Rachlinski 1995).  (Such evidence is in theory—but not in practice— “categorically excluded” under Rule 407, when the correction was made after the injury to the plaintiff; but in any case, Rule 407 wouldn’t apply, only Rule 403 would, if the change in product design were made after injuries to third parties but before the plaintiff herself was injured by the original product—even though the same “hindsight bias” risk would be presented).

“All” the judge has to do in that case is compare the marginal accuracy-diminishing impact of [1] giving no weight at all to the evidence (LR = 1) on the "facts of consequence"  it should otherwise have made "more probable" (e.g, the actual existence of alternative designs and their cost-effectiveness) and [2] the inflationary effect of admitting it on the LRs assigned to the evidence bearing on every other fact of consequence (e.g., what a reasonable manufacturer would have concluded about the level of risk and feasibility of alternative designs at the time the original product was designed).

The BCCM conception of 403 "marginal probity vs. marginal prejudice" balancing! A thoughtful person might wonder about the capacity of a judge to make that determination accurately, particularly because weighing the “marginal accuracy diminishing impact” associated with admission and with exclusion, respectively,  actually requires the judge to gauge the relative strength of all the remaining evidence in the case. See Old Chief v. U.S., 519 U.S. 127, 182-85 (1997).

But making such a determination is not, in theory at least, impossible.

What is is doing this same kind of analysis when the source of the “prejudice” is CBR.  When a judge uses BCCM to manage the impact of hindsight bias (or any other type of dynamic inimical to rational information-processing), “marginal probative value” and “marginal prejudice”—the quantities she must balance—are independent.

But when the bias the judge is trying to contain is CBR, “marginal probative value” and “marginal prejudice” are interdependent—and indeed positively correlated.

What triggers the “likelihood ratio cascade” that is characteristic of CBR as a cognitive bias is the correct LR the factfinder assigned whatever item of proof induced the factfinder to form the impression that one side’s position was stronger than the other’s. Indeed, the higher (or lower) the “true” LR of that item of proof, the more confident the facftinder will be in the position that evidence supports, and hence the more biased the factfinder will thereafter be in assessment of the weight due other pieces of evidence (or equivalently, the more indifferent she'll become to the risk of erring in the direction of that position (Scurich 2012)).

To put it plainly, CBR creates a war between the two foundational “rules of evidence”: the more relevant evidence is under Rule 401 the more unfairly prejudicial it becomes for purposes of Rule 403.  To stave off the effects of CBR on accurate factfinding, the court would have to exclude from the case the evidence most integral to reaching an accurate determination of the facts.

Maybe an illustration would be useful?

This is one case plucked from the sort of simulation that I ran yesterday:

It shows how, as a result of CBR, a case that was in fact a “dead heat” can transmute into one in which the factfinder forms a supremely confident judgment that the facts supporting one side’s case The sad result of trying to do BCCM 403 balancing here...are “true.”

The source of the problem, of course, is that the very “first” item of proof had LR = 25, initiating a “likelihood ratio cascade” as reflected in the discrepancy between the "true" LRs—tLRs—and "biased" perceived LRs—pLRs—for each subsequent item of proof.

A judge applying the BCCM conception of Rule 403 would thus recognize that "item of proof No. 1" is injecting a huge degree of “prejudice” into the case. She should thus exclude proof item No. 1, but only if she concludes that doing so will diminish the accuracy of the outcome less than preventing the factfinder from giving this highly probative piece of evidence any effect whatsoever.

When the judge engages in this balancing, she will in fact observe that the effect of excluding that evidence distorts the accuracy of the outcome just as much as admitting it does--but in the opposite direction. In this simulated case, assigning item No. 1 an LR = 1—the formal effect of excluding it—now induces the factfinder to conclude that the odds against that party’s position being true are 5.9x10^2:1, or that that there is effectively a 0% chance that that party’s case is well-founded.

That’s because the very next item of proof has LR = 0.04 (the inverse of LR = 25), and thus triggers a form of “rolling confirmation bias” that undervalues every subsequent item of proof.

So if the judge were to exclude item No. 1 b/c of its tendency to excite CBR, she’d now face the same issue confronts her again in ruling on a motion to exclude item No. 2.

And guess what? If she assesses the impact of excluding that super probative piece of evidence (one that favored one party’s position 25x more than the other’s), she’ll again find that the “accuracy diminishing impact” of doing so is as high as not excluding: the remaining evidence in the case is configured so that the factfinder is impelled to a super-confident conclusion in favor of the first party once more!

And so forth and so on.

As this illustration should remind you, CBR also has the effect of making outcomes arbitrarily sensitive to the order of proof. 

Imagine item 1 and item 2 had been “encountered” in the opposite “order” (whether by virtue of the point at which they were introduced at trial, the relative salience of them to the factfinder as he or she reflected on the proof as a whole, or the role that post-trial deliberations had in determining the sequence with which particular items of proof were evaluated). 

The factfinder in that case would indeed have formed just as confident a judgment--but one in support of the opposite party:

Again, the judge will be confronted with the question whether the very “first” item of proof—what was item No. 2  in the last version of this illustration—should be excluded under Rule 403. When she works this out, moreover, she’ll end up discovering that Again, 403 balancing is impossible here--it is self-contradictory!the consequence of excluding it is the same as was the consequence of excluding item No. 1—LR = 25—in our alternative-universe version of the case: a mirror-image degree of confidence on the factfinder's part about the strength of the opposing party’s case.  And so  on and so forth.

See what’s going on?

The only way for the judge to assure that this case gets decided “accurately” is to exclude every single piece of evidence from the trial, remitting the jury to its priors—1:1—which, by sheer accident, just happened to reflect the posterior odds a “rational factfinder” would have ended up with after fairly assigning each piece of evidence its “true” LR.

Not much point having a trial at all under those circumstances!

Of course, the evidence, when properly considered, might have more decisively supported one side or the other.  But what a more dynamic simulation--one that samples from all the various distributions of case strength one cares to imagine-- shows us is that there’s still no guarantee the factfinder would have formed an accurate impression of the strength of the evidence in that cirucmstance either.

To assure an accurate result in such a cse, the judge, under the BCCM conception of the rules of evidence, would still have been obliged to try to deflect the accuracy-vitiating impact of CBR away from the factfinder’s appraisal of the evidence by Rule 403 balancing. 

And the pieces of evidence that the judge would be required in such a case to exclude would be the ones most entitled to be given a high degree of weight by a rational factfinder!  The impact of doing so would be to skew consideration of the remainder of the evidence without offsetting exclusions of similarly highly relevant pieces of proof. . . . 

Again, no point in even having  a trial if that’s how things are going to work. The judge should just enter judgment for the party she thinks “deserves” to win.

There is of course no reason to believe a judge could “cognitively fine-tune” a case with the precision that this illustration envisions.  But all that means is that the best a real judge can ever do will always generate an outcome that we have less reason to be confident is “right” than we would have had had the judge just decided the stupid case herself on the basis of her own best judgment of the evidence.

Of course, why should we assume the judge herself could make an accurate assessment, or reasonably accurate one, of the trial proof?  Won’t she be influenced by CBR too—in a way that distorts her capacity to do the sort of “marginal probative value vs. marginal prejudice” weighing that the BCCM conception of Rule 403 imagines?

If you go down this route, then you again ought to conclude that “rules of evidence are impossible” even without contemplating the uniquely malicious propensities of CBR.  Because if this is how you see things (Schauer 2006), there will be just as much reason to think that the judge’s performance of such balancing will be affected by all the other forms of cognitive bias that she is trying to counteract by use of BCCM’s conception of Rule 403 balancing.

I think that anxiety is in fact extravagant—indeed silly.

There is plenty of evidence that judges, by virtue of professionalization, develop habits of mind that reasonably insulate them from one or another familiar form of cognitive bias when the judges are making in-domain decisions—i.e., engaging in the sort of reasoning they are supposed to as judges (Kahan, Hoffman, et al. in press; Guthrie, Rachlinksi & Wistrich 2007) .

That’s how professional judgment works generally!

But now that I’ve reminded you of this, maybe you can see what the “solution” is to the “impossibility” of the rules of evidence?

Even a jurist with exquisite professional judgment cannot conceivably perform the kind of “cognitive fine-tuning” ‘envisioned by the “rules of evidence” -- the whole enterprise is impossible.

But what makes such fine tuning necessary in the first place is the law’s use of  non-professional decisionmakers divorced from any of the kinds of insights and tools that professional legal truthseekers would actually use.

Jurors aren’t stupid.  They are equipped with all the forms of practical judgment that they need to be successful in their everyday lives.

What's stupid is to think that making reliable assessments of fact in the artificial environment of a courtroom advesarial proceeding is one of the things everday life equips them to do. 

Indeed, it's absurd to think that that environment is conducive to the accurate determination of facts by anyone.

A procedural mechanism that was suited for accurately determining the sorts of facts relevant to legal determinations would have to look different from anything we see in in everyday life, b/c making those sorts of determinations isn't something that everyday life requires.

No more than than having to practice medicine, repair foreign automobiles, or write publicly accessible accounts of relativity is (btw, happy birthday Die Feldgleichungen der Gravitation).

Ordinary, sensible people rely on professionals -- those who dedicate themselves to acquiring expert knowledge and corresponding forms of reasoning proficiency -- to perform specialized tasks like these.

The “rules of evidence” are impossible because the mechanism we rely on to determine the “truth” in legal proceedings—an adversary system with lay factfinders—is intrinsically flawed. 

No amount of fine-tuning by “rules of evidence” will  ever make that system capable of delivering the accurate determinations of their rights and obligations that citizens of an enlightened democratic state are entitled to.

We need to get rid of the current system of adjudication and replace it with a professionalized system that avails itself of everything we know about how the world works, including how human beings reason and how they can be trained to reason when doing  specialized tasks.

And we need to replace, too, the system of legal scholarship that generates the form of expertise that consists in being able to tell  soothing, tranquilizing, narcotizing just-so stories about how well suited the “adversary system” would be for truth-seeking with just a little bit  more "cognitive fine tuining" to be implemented through the rules of evidence.

That element of our legal culture is as antagonistic to the goal of truth-seeking as any the myriad defects of the adversary system itself. . . .

The end!


Guthrie, C., Rachlinski, J.J. & Wistrich, A.J. Blinking on the bench: How judges decide cases. Cornell Law Rev 93, 1-43 (2007).

Kahan, D.M. The Economics—Conventional, Behavioral, and Political—of "Subsequent Remedial Measures" Evidence. Columbia Law Rev 110, 1616-1653 (2010).

Kahan, D.M., Hoffman, D.A., Evans, D., Devins, N., Lucci, E.A. & Cheng, K. 'Ideology'or'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment. U. Pa. L. Rev. 164 (in press).

Kahan, D.M. Laws of cognition and the cognition of law. Cognition 135, 56-60 (2015).

Kamin, K.A. & Rachlinski, J.J. Ex Post ≠ Ex Ante - Determining Liability in Hindsight. Law Human Behav19, 89-104 (1995).

Lempert, R.O. Modeling Relevance. Mich. L. Rev. 75, 1021-57 (1977).

Pennington, N. & Hastie, R. A Cognitive Theory of Juror Decision Making: The Story Model. Cardozo L. Rev. 13, 519-557 (1991).

Schauer, F. On the Supposed Jury-Dependence of Evidence Law. U. Pa. L. Rev. 155, 165-202 (2006).

Scurich, N. The Dynamics of Reasonable Doubt. (Ph.D. dissertation, University of Southern California, 2012). 

Simon, D. A Third View of the Black Box: Cognitive Coherence in Legal Decision Making. Univ. Chi. L.Rev. 71, 511-586 (2004).

Simon, D., Pham, L.B., E, Q.A. & Holyoak, K.J. The Emergence of Coherence over the Course of Decisionmaking. J. Experimental Psych. 27, 1250-1260 (2001).


Check out wild & crazy "coherence based reasoning"! Are rules of evidence "impossible"?, part 2 (another report from Law & Cognition seminar)

If you want to do BCCM, you definitely should draw lots of little diagrams like thisThis is part 2 in a 3-part series, the basic upshot of which is that “rules of evidence” are impossible.

A recap. Last time I outlined a conception of “the rules of evidence” I called the “Bayesian Cognitive Correction Model” or BCCM.  BCCM envisions judges using the rules to “cognitively fine-tune” trial proofs in the interest of simulating/stimulating jury fact-finding more consistent with a proper Bayesian assessment of all the evidence in a case. 

Cognitive dynamics like hindsight bias and identity-protective cognition can be conceptualized as inducing the factfinder to over- or undervalue evidence relative to its “true” weight—or likelihood ratio (LR).  Under Rule 403, Judges should thus exclude an admittedly “relevant” item of proof (Rule 401: LR ≠ 1) when the tendency of that item of proof to induce jurors to over- or undervalue of other items of proof (i.e., to assign them LRs that differ from 1 more than they actually do) impedes verdict accuracy more than constraining the factfinder to assign the item of proof in question no weight at all (LR = 1).

“Coherence based reasoning”—CBR—is one of the kinds of cognitive biases a judge would have to use the BCCM strategy to contain..  This part in the series describes CBR and the distinctive threat it poses to rational factfinding in adjudication.

Today's episode. CBR can be viewed as an information-processing dynamic rooted in aversion to residual uncertainty.

Good study!A factfinder, we can  imagine, might initiate his or her assessment of the evidence in a reasonably unbiased fashion, assigning modestly probative pieces of evidence more or less the likelihood ratios they are due.

But should she encounter a piece of evidence that is much more consistent with one party’s position, the resulting confidence in that party’s case (a state that ought to be only provisional, in a Bayesian sense) will dispose her to assign the next piece of evidence a likelihood ratio supportive of the same inference—viz., that that party’s position is “true.”  As a result, she’ll be all the more confident in the merit of that party’s case—and thus all the more motivated to adjust the weight assigned the next piece of evidence to fit her “provisional” assessment, and so forth and so on  (Carlson & Russo 2001). 

Once she has completed her evaluation of trial proof, moreover, she will be motivated to revisit earlier-considered pieces of evidence, readjusting the weight she assigned them so that they now fit with what has emerged as the more strongly supported position ( (Simon, Pham, Quang & Holyoak 2001; Holyoak & Simon; Pennington & Hastie 1991). When she concludes, she will necessarily have formed an inflated assessment of the probability of the facts that support the party whose “strong” piece of evidence initiated this “likelihood ratio cascade.”

What does this matter?

Well, to start, in the law, the party who bears the “burden of proof” will often be entitled to win only if she establishes the facts essential to her position to a heightened degree of certainty like “beyond a reasonable doubt.”  One practical consequence of the overconfidence associated with CBR, then, will be to induce the factfinder to decide in favor of a party whose evidence, if evaluated in an unbiased fashion, would not have satisfied the relevant proof standard (Simon 2004).  Indeed, one really cool set of experiments (Scurich 2012) suggests that "coherence based reasoning" effects might actually reflect a dissonance-avoidance mechanism that manifests itself in factfinders reducing the standard of proof after exposure to highly probative items of proof! 

But even more disconcertingly, CBR makes the outcome sensitive to the order in which critical pieces of evidence are considered (Carlson, Meloy & Russo 2006). 

A  piece of evidence that merits considerable weight might be assigned a likelihood ratio of  1 or < 1 if the factfinder considers it after having already assigned a low probability to the position it supports.  In that event, the evidence will do nothing to shake the factfinder’s confidence in the opposition position.

But had the factfinder considered that same piece of evidence “earlier”—before she had formed a confident estimation of the cumulative strength of the previously considered proof—she might well have given that piece of evidence the greater weight it was due. 

Once BCCM practioner draws *this* diagram, though, she'll freak outIf that had happened, she would then have been motivated to assign subsequent pieces of proof likelihood ratios higher than they in fact merited. Likewise, to achieve a “coherent” view of the evidence as a whole, she would have been motivated to revisit and revise upward the weight assigned to earlier considered, equivocal items of proof.  The final result would thus have been a highly confident determination in exactly the opposite direction from the one she in fact reached.

This not the way things should work if one is engaged in Bayesian information processing—or at least any normatively defensible understanding of Bayesian information processing geared to reaching an accurate result!

Indeed, this is the sort of spectacle that BCCM directs the judge to preempt by the judicious use of Rule 403 to exclude evidence the “prejudicial” effect of which “outweighs” its “probative value.”

But it turns out that using the rules of evidence to neutralize CBR in that way is IMPOSSIBLE!

Why? I’ll explain that in Part 3!

# # #

But right now I’d like to have some more, “extra-credit”/“optional” fun w/ CBR! It turns out it is possible & very enlightening to create a simulation to model the accuracy-annihilating effects I described above.

Actually, I’m just going to model a “tame” version of CBR—what Carlson & Russo call “biased predecisional processing.” Basically, it’s the “rolling confirmation bias” of CBR without the “looping back” that occurs when the factfinder decides for good measure to reassess the more-or-less unbiased LRs she awarded to items of proof before she became confident enough to start distorting all the proof to fit one position. 

Imagine that a factfinder begins with the view that the “truth” is equally likely to reside in either party’s case—i.e., prior odds of 1:1. The case consists of eight “pieces” of evidence, four pro-prosecutor (likelihood ratio > 1) and four pro-defendant (likelihood ratio <1). 

The factfinder makes an unbiased assessment of the “first” piece of evidence she considers, and forms a revised assessment of the odds that reflects its “true” likelihood ratio.  As a result of CBR, however, her assessment of the likelihood ratio of the next piece of evidence—and every piece thereafter—will be biased by her resulting perception that one side’s case is in fact “stronger” than the other’s.

To operationalize this, we need to specify a “CBR factor” of some sort that reflects the disposition of the factfinder to adjust the likelihood ratios of successive pieces of proof up or down to match her evolving (and self-reinforcing!) perception of the strength disparity in the parties’  the party’s case.

Imagine the factfinder misestimates the likelihood ratio of all pieces evidence by a continuous amount that results in her over-valuing or under-valuing an item of proof by a factor of 2 at the point she becomes convinced that the odds in favor of one party’s position rather than the other’s position being “true” has reached 10:1.

What justifies selecting this particular “CBR factor”? Well, I suppose nothing, really, besides that it supplies a fairly tractable starting point for thinking critically about the practical upshot of CBR. 

But also, it’s cool to use this function b/c it reflects a “weight of the evidence” metric developed by Turing and Good to help them break the Enigma code! 

For Turing and Good, a piece of evidence with a likelihood ratio of 10 was judged to have a weight of “1 ban.” They referred to a piece of proof that had a likelihood ratio 1/10 that big as a “deci-ban”—and were motivated to use that as the fundamental unit of evidentiary currency in their code-breaking system based on their seat-of-the-pants conjecture that a “deciban” was the smallest shift in the relative likelihoods of two hypotheses that human beings could plausibly perceive (Good 1985). 

So with this “CBR factor,” I am effectively imputing to the factfinder a disposition to “add to”  (or subtract from) an item of proof one “deciban”—the smallest humanly discernable “evidentiary weight,” in Turing and Good’s opinion—for every 1-unit increase (1:1 to 2:1; 2:1 to 3:1, etc.) or (decrease--1:1 to 1:2; 1:3 to 1:4) in the “odds” of that party’s position being true.

And this figure illustrates how this distorting potential can be affected by CBR generally:

In the “unbiased” table, “prior” reflects the factfinder’s current estimate of the probability of the “prosecutor’s” position being true, and “post odds” the revised estimate based on the weight of the current “item” of proof, which is assigned the likelihood ratio indicated in the “LT” column.  The “post %” column transforms the revised estimate of the probability of “guilt” into a percentage. 

I’ve selected an equal number of pro-prosecution (LR >1) and pro-defense (LR<1) items of proof, and arranged them so they are perfectly offsetting—resulting in a final estimate of guilt of 1:1 or 50%.

In the “coherence based reasoning” table, “tLR” is the “true likelihood ratio” and “pLR” the perceived likelihood ratio assigned the current item of proof. The latter is derived by applying the CBR factor to the former.  When the odds are 1:1, CBR is 1, resulting in no adjustment of the weight of the evidence. But as soon as the odds shift in one party’s favor, the CBR factor biases the assessment of the next item of proof accordingly.

As can be seen, the impact of CBR in this case is to push the factfinder to an inflated estimate of the strength of the prosecution’s  position being true, which at the factfinder puts at 29:1 or 97% by the “end” of the case.

But things could have been otherwise. Consider:

I’ve now swapped the “order” of proof items “4” and “8,” respectively.  That doesn't make any difference, of course, if one is "processing" the evidence they way a Bayesian would; but it does if one is CBRing.

The reason is that the factfinder now “encounters” the defendant’s strongest item of proof -- LR = 0.1—earlier than the prosecution’s strongest—LR = 10.0.

Indeed, it was precisely because the factfinder encountered the prosecutor’s best item of proof “early” in the previous case that she was launched into a self-reinforcing spiral of overvaluation that made her convinced that a dead-heat case was a runaway winner for the prosecutor.

The effect when the proof is reordered this way is exactly the opposite: a devaluation cascade that convinces the factfinder that the odds in favor of the prosecutor’s case are infinitesimally small!

These illustrations are static, and based on “pieces” of evidence with stipulated LRs “considered” in a specified order (one that could reflect the happenstance of when particular pieces register in the mind of the factfinder, or are featured in post-trial deliberations, as well as when they are “introduced” into evidence at trial—who the hell knows!).

But we can construct a simulation that randomizes those values in order to get a better feel for the potentially chaotic effect that CBR injects into evidence assessments. 

The simulation constructs trial proofs for 100 criminal cases, each consisting of eight pieces of evidence. Half of the 800 pieces of evidence reflect LRs drawn randomly from a uniform distribution between 0.05 and 0.95; these are “pro-defense” pieces of evidence. Half reflect LRs drawn randomly from a uniform distribution between 1.05 and 20. They are “pro-prosecution” pieces.

We can then compare the “true” strength of the evidence in the 100 cases —the probability of guilt determined by Bayesian weighting of each one’s eight pieces of evidence—to the “biased” assessment generated when the likelihood ratios for each piece of evidence are adjusted in a manner consistent with CBR.

This figure compares the relative distribution of outcomes in the 100 cases:


As one would expect, a factfinder whose evaluation is influenced by CBR will encounter many fewer “close” cases than will one that engages in unbiased Bayesian updating.

This tendency to form overconfident judgments will, in turn, affect the accuracy of case outcomes.  Let’s assume, consistent with the “beyond a reasonable doubt” standard, that the prosecution is entitled to prevail only when the probability of its case being “true” is ≥ 0.95.  In that case, we are likely to see this sort of divergence between outcomes informed by rational information processing and outcomes informed by CBR:


The overall “error rate” is “only” about 0.16.  But there are 7x as many incorrect convictions as incorrect acquittals.  The "false conviction" rate is 0.21, wheras the "false acquittal" rate is 0.04....

The reason for the asymmetry between false convictions and false acquittals is pretty straightforward. In the CBR-influenced cases, there are a substantial number of “close” cases that factfinder concluded “strongly” supported one side or the other. Which side—prosecution or defendant—got the benefit of this overconfidence is roughly equally divided.  However, a defendant is no less entitled to win when the factfinder assesses the strength of the evidence to be 0.5 or 0.6 than when the factfinder assesses the strength of the evidence as 0.05 or 0.06.  Accordingly, in all the genuinely “close” cases in which CBR induced the factfinder to form an overstated sense of confidence in the weakness of the prosecution’s case, the resulting judgment of “acquittal” was still the correct one.  But by the same token, the result was incorrect in every close case in which CBR induced the factfinder to form an exaggerated sense of confidence in the strength of the prosecution’s case.  The proportion of cases, in sum, in which CBR can generate a “wrong” answer is much higher in ones that defendants deserve to win than in ones in which the prosecution does.

This feature of the model is an artifact of the strong “Type 1” error bias of the “beyond a reasonable doubt” standard.  The “preponderance of the evidence” standard, in contrast, is theoretically neutral between “Type 1” and “Type 2” errors.  Accordingly, were we treat the simulated cases as “civil” rather than “criminal” ones, the false “liability” outcomes and false “no liability” ones would be closer to the overall error rate of 16%.

Okay, I did this simulation once for 100 cases.  But let’s do it 1,000 times for 100 cases—so that we have a full-blown Monte Carlo simulation of the resplendent CBR at work!

These are the distributions for the “accurate outcome” “false acquittal,” and “false conviction” rates over 1000 trials of 100 cases each:

Okay—see you later!


Carlson, K.A. & Russo, J.E. Biased interpretation of evidence by mock jurors. Journal of Experimental Psychology: Applied 7, 91-103 (2001).

I.J. Good, Weight of Evidence: A Brief Survey, in Bayesian Statistics 2: Proceedings of the Second Valencia International Meeting (J.M. Bernardo, et al. eds., 1985).

Keith J. Holyoak & Dan Simon, Bidirectional Reasoning in Decision Making by Constraint Satisfaction,  J. Experimental Psych. 128, 3-31 (1999).

Kahan, D.M. Laws of cognition and the cognition of law. Cognition 135, 56-60 (2015). 

Pennington, N. & Hastie, R. A Cognitive Theory of Juror Decision Making: The Story Model. Cardozo L. Rev. 13, 519-557 (1991).

Simon, D. A Third View of the Black Box: Cognitive Coherence in Legal Decision Making. Univ. Chi. L.Rev. 71, 511-586 (2004).

Scurich, N. The Dynamics of Reasonable Doubt. (Ph.D. dissertation, University of Southern California, 2012). 

Simon, D., Pham, L.B., E, Q.A. & Holyoak, K.J. The Emergence of Coherence over the Course of Decisionmaking. J. Experimental Psych. 27, 1250-1260 (2001).

CBR ... frankenstein's monster of law & psychology...



Report from "Law & Cognition" class: Are “rules of evidence impossible”? Part 1 

Well, I didn't do a good job of sharing the to & fro of this semester's Law & Cognition seminar w/ the 14 billion of you who signed up to take the coure on-line. I'm happy to refund your enrollment fees--I actually parleyed them into a sum 10^3 x as large by betting incredulous behavioral economists that P(H|HHH) < P(H) when sampling from finite sequences w/o replacement-- but stay tuned & I'll try to fill you in over time...

If you’re a Bayesian, you’ll easily get how the Federal Rules of Evidence work. 

But if you accept that “coherence based reasoning” characterizes juries’ assessments of facts (Simon, Pham, Quang & Holyoak 2001; Carlson & Russo 2001), you’ll likely conclude that administering the Rules of of Evidence is impossible.

Or so it seems to me.  I’ll explain but it will take some time—about 3 posts’ worth.

The "Rules of Evidence Impossibility Proof"--Paaaaaaart 1!

There are really only two major rules of evidence. There are a whole bunch of others but they are just variations on a theme.

The first is Rule 401, which states that evidence is “relevant” (and hence presumptively admissible under Rule 402) if it “has any tendency to make a fact  [of consequence to the litigation] more or less probable” in the assessment of a reasonable factfinder.

As Richard Lempert observed (1977) in his article Modeling Relevance, Rule 401 bears a natural Bayesian interpretation.

The “likelihood ratio” rendering of Bayes’s Theorem—Posterior odds = Prior odds x Likelihood Ratio—says that one should update one’s existing or “prior” assessment of the probability of some hypothesis (expressed in odds) by a factor that reflects how much more consistent the new information is with that hypothesis than with some rival hypothesis.  If this factor—the likelihood ratio—is greater than one, the probability of the hypothesis increases; if it is less than one, it decreases.

Accordingly, by defining as “relevant” any evidence that gives us reason to treat a “fact of consequence” as “more or less probable,” Rule 401 indicates that evidence should be treated as relevant (and thus presumptively admissible) so long as it has a likelihood ratio different from 1—the factor by which one should revise one’s prior odds when new evidence is equally consistent with the hypothesis and with its negation.


Second is Rule 403, which states that “relevant evidence” should be excluded if its “probative value is substantially outweighed by . . . unfair prejudice.”  Evidence is understood to be “unfairly prejudicial” when (the Advisory Committee Notes tell us) it has a “tendency to suggest decision on an improper basis.” 

There’s a natural Bayesian rendering of this concept, too: because the proper basis for decision reflects the updating of one’s priors by a factor equal to the product of the likelihood ratios associated with all the (independent) items of proof, evidence is prejudicial when it induces the factfinder to weight items of proof inconsistent with their true likelihood ratios

Lempert crica 1977 (outside Studio 54, during break from forensic science investigation of then-still unsolved Son of Sam killing spree)An example would be evidence that excites a conscious intention—born perhaps of animus, or alternatively of sympathy—to reach a particular result regardless of the Bayesian import of the proof in the case.

More interestingly, a piece of evidence might be “unfairly prejudicial” if it triggers some unconscious bias that skews the assignment of the likelihood ratio to that or another piece of evidence (Gold 1982).

E.g., it is sometimes said (I think without much basis) that jurors “overvalue” evidence of character traits—that is, that they assign to a party’s disposition a likelihood ratio, or degree of weight, incommensurate with what it is actually due when assessing the probability that the party acted in a manner that reflected such a disposition on a particular occasion (see Fed. R. Evid. 404).

Or the “unfairly prejudicial effect” might consist in the tendency of evidence to excite cognitive dynamics that bias the weight assigned other pieces of evidence (or all of it).  Evidence that an accident occurred, e.g., might trigger  “hindsight bias,” causing the factfinder to assign more weight than is warranted to evidence that bears on how readily that accident could have been foreseen before its occurrence (Kaman & Rachlinski 1995).

By the same token, evidence that excites “identity-protective cognition” might unconsciously motivate a factfinder to selectively credit or dismiss (i.e., opportunistically adjust the likelihood ratio of) all the evidence in the case in a manner geared to reaching an outcome that affirms rather than denigrates the factfinder’s cultural identity (Kahan 2015).

Rule 403 directs the judge to weigh probity and prejudice.

Again, there’s a Bayesian rendering: a court should exclude a “relevant” item of proof as “unfairly prejudicial” when the marginal distortion of accuracy associated with the the incorrect likelihood ratio that evidence induces a factfinder to assign any item of proof is bigger than the marginal distortion of accuracy associated with constraining the factfinder to assign that item of proof a likelihood ratio of 1, which is the practical effect of excluding it (Kahan 2010).  

click me & behold what it looks like to do Bayesian analysis of evidence rules *after* emerging from a night of partying at Studio 54 circa 1977!If you work this out, you’ll see (perhaps counterintuitively, perhaps not!) that courts should be much more reluctant to exclude evidence on Rule 403 grounds in otherwise close cases. As cases become progressively closer, the risk of error associated with under-valuing (by failing to consider) relevant evidence increases faster than the the risk of error associated with over-valuing that same evidence: from the point of view of deciding a case, being “ovderconfident” is harmless so long as one gets the right result. Likewise the risk that admitting "prejudicial" evidence will result in error increases more rapidly as the remaining proof becomes weaker: that's the situation in which a facfinder is most likely to decide for a party that she wouldn't have but for her biased over-valuing of the offending piece of evidence (Kahan 2010).

For an alternative analysis, consider Friedman (2003); I think he's wrong but for sure maybe I am! You tell me!

The point is how cool it is-- how much structure & discipline it adds to the analysis-- to conceptualize Rules of Evidence as an instrument for closing the gap between what a normatively desirable Bayesian assessment of trial proof would yield and what a psycholigically realistic account of human information processing tells us to expect (someday, of coures, we'll replace human legal decisionmakers with AI evidence-rule robots! but we aren't quite there yet ...).

Let's call this approach to understanding/perfecing evidence law the "Bayesian Cognitive Correction Model" (BCCM).

But is BCCM itself psychologically realistic?  

Is it plausible to to think a court can reliably “maximize” the accuracy of adjudication by this sort of cognitive fine-tuning of the trial proof?

Not if you think that coherence-based reasoning  (CBR) is one of the reasoning deficiencies that a court needs to anticipate and offset by this strategy.

I’ll describe how CBR works in part 2 of this series—and then get to the “impossibility proof” in part 3!


Carlson, K.A. & Russo, J.E. Biased interpretation of evidence by mock jurors. Journal of Experimental Psychology: Applied 7, 91-103 (2001).

Friedman, R.D. Minimizing the Jury Over-valuation Concern. Mich. State L. Rev. 2003, 967-986 (2003).

Gold, V.J. Federal Rule of Evidence 403: Observations on the Nature of Unfairly Prejudicial Evidence. Wash. L. Rev. 58, 497 (1982).

Kahan, D.M. The Economics—Conventional, Behavioral, and Political—of "Subsequent Remedial Measures" Evidence. Columbia Law Rev 110, 1616-1653 (2010).

Kahan, D.M. Laws of cognition and the cognition of law. Cognition 135, 56-60 (2015).

Kamin, K.A. & Rachlinski, J.J. Ex Post ≠ Ex Ante - Determining Liability in Hindsight. Law Human Behav 19, 89-104 (1995).

Lempert, R.O. Modeling Relevance. Mich. L. Rev. 75, 1021-57 (1977).

Simon, D., Pham, L.B., E, Q.A. & Holyoak, K.J. The Emergence of Coherence over the Course of Decisionmaking. J. Experimental Psych. 27, 1250-1260 (2001).


My remote post-it notes for my HLS African-American teachers


ISO: A reliable & valid public "science literacy" measure

From revision to “Ordinary Science Intelligence”: A Science-Comprehension Measure for Study of Risk and Science Communication, with Notes on Evolution and Climate Change . . . .

 2. What and why?

The validity of any science-comprehension instrument must be evaluated in relation to its purpose. The quality of the decisions ordinary individuals make in myriad ordinary roles—from consumer to business owner or employee, from parent to citizen—will depend on their ability to recognize and give proper effect to all manner of valid scientific information (Dewey 2010; Baron 1993). It is variance in this form of ordinary science intelligence—and not variance in the forms or levels of comprehension distinctive of trained scientists, or the aptitudes of prospective science students—that OSI_2.0 is intended to measure.

This capacity will certainly entail knowledge of certain basic scientific facts or principles. But it will demand as well various forms of mental acuity essential to the acquisition and effective use of additional scientific information. A public science-comprehension instrument cannot be expected to discern proficiency in any one of these reasoning skills with the precision of an instrument dedicated specifically to measuring that particular form of cognition. It must be capable, however, of assessing the facility with which these skills and dispositions are used in combination to enable individuals to successfully incorporate valid scientific knowledge into their everyday decisions.

A valid and reliable measure of such a disposition could be expected to contribute to the advancement of knowledge in numerous ways. For one thing, it would facilitate evaluation of science education across societies and within particular ones over time (National Science Board 2014). It would also enable scholars of public risk perception and science communication to more confidently test competing conjectures about the relevance of public science comprehension to variance in—indeed, persistent conflict over—contested risks, such as climate change (Hamilton 2011; Hamilton, Cutler & Shaefer 2012), and controversial science issues such as human evolution (Miller, Scott & Okamoto 2006). Such a measure would also promote ongoing examination of how science comprehension influences public attitudes toward science more generally, including confidence in scientific institutions and support for governmental funding of basic science research (e.g., Gauchat 2011; Allum, Sturgis, Tabourazi, & Brunton-Smith 2008). These results, in turn, would enable more critical assessments of the sorts of science competencies that are genuinely essential to successful everyday decisionmaking in various domains—personal, professional, and civic (Toumey 2011).

In fact, it has long been recognized that a valid and reliable public science-comprhension instrument would secure all of these benefits. The motivation for the research reported in this paper is widespread doubt among scholars that prevailing measures of public “science literacy” possess the properties of reliability and validity necessary to attain these ends (e.g., Stocklmayer & Bryant 2012; Roos 2012; Guterbock et al. 2011; Calvo & Pardo 2004). OSI_2.0 was developed to remedy these defects.

The goal of this paper is not only to apprise researchers of OSI_2.0’s desirable characteristics in relation to other measures typically featured in studies of risk and science communication. It is also to stimulate these researchers and others to adapt and refine OSI_2.0, or simply devise a superior alternative from scratch, so that researchers studying how risk perception and science communication interact with science comprehension can ultimately obtain the benefit of a scale more distinctively suited to their substantive interests than are existing ones.


Allum, N., Sturgis, P., Tabourazi, D. & Brunton-Smith, I. Science knowledge and attitudes across cultures: a meta-analysis. Public Understanding of Science 17, 35-54 (2008).

Baron, J. Why Teach Thinking? An Essay. Applied Psychology 42, 191-214 (1993).

Dewey, J. Science as Subject-matter and as Method. Science 31, 121-127 (1910).

Gauchat, G. The cultural authority of science: Public trust and acceptance of organized science. Public Understanding of Science 20, 751-770 (2011).

Hamilton, L.C. Education, politics and opinions about climate change evidence for interaction effects. Climatic Change 104, 231-242 (2011).

Hamilton, L.C., Cutler, M.J. & Schaefer, A. Public knowledge and concern about polar-region warming. Polar Geography 35, 155-168 (2012).

Miller, J.D., Scott, E.C. & Okamoto, S. Public acceptance of evolution. Science 313, 765 (2006).

National Science Board. Science and Engineering Indicators, 2014 (National Science Foundation, Arlington, Va., 2010).

Pardo, R. & Calvo, F. The Cognitive Dimension of Public Perceptions of Science: Methodological Issues. Public Understanding of Science 13, 203-227 (2004).

Roos, J.M. Measuring science or religion? A measurement analysis of the National Science Foundation sponsored science literacy scale 2006–2010. Public Understanding of Science (2012).

Stocklmayer, S. M., & Bryant, C. Science and the Public—What should people know?, International Journal of Science Education, Part B, 2(1), 81-101 (2012)


The "living shorelines" science communication problem: individual cognition situated in collective action

Extending its Southeast Florida Evidence-based Science Communication Initiative, CCP is embarking on a field-research project on "living shoreline" alternatives/supplements to "hardened armoring" strategies for offsetting the risks of of rising sea levels. The interesting thing about the project (or one of the billion interesting things about it) is that it features the interaction of knowledge and expectations.  

"Living shorelines" offer the potential for considerable collective benefits.  But individuals who learn of these potential benefits will necessarily recognize that the benefit they can expect to realize from taking or supporting action to implement this strategy is highly contingent on the intention of others to do the same. Accordingly, "solving" this "communication problem" necessarily involves structuring acommunication process in which parties learn simultaneously about both the utility of "living shorelines" and the intentions of other parties to contribute to implementing them.

The project thus highlights one of the central features of the "science of science communication" as a "new political science": its focus not only on promoting clarity of exposition and public comprehension but on attending as well to the myriad social processes by which members of the public come to know what's known by science and give it due effect in their lives.

Elevating “Living Shorelines” with Evidence-based Science Communication

1. Overview. The urgency of substantial public investments to offset the impact of rising sea levels associated with climate change is no longer in a matter of contention for coastal communities in Florida.  What remains uncertain is only the precise form of such undertakings.

This project will use evidence-based science communication to enrich public engagement with “living shoreline” alternatives (e.g., mangrove habitats, oyster beds, dune and wetland restoration)  for “hardened armoring” strategies (concrete seawalls, bunkers, etc.). “Living shorelines” offer comparable protection while avoid negative environmental effects--beachfront erosion, the loss of shoreline vegetation, resulting disruption of natural ecosystems, and visual blight—that themselves diminish community wellbeing.  The prospect that  communities in Southern Florida will make optimal use of “living shorelines,” however, depends on cultivating awareness of their myriad benefits among a diffuse set of interlocking public constituencies.  The aim of the proposed initiative is to generate the forms of information and community interactions necessary to enable “living shorelines” to assume the profile that it should in ongoing democratic deliberations over local climate adaptation. . . .

3. Raising the profile of “living shorelines.” There are numerous living shoreline” alternatives to hardened armoring strategies. Mangroves—densely clumped shrubs of thick green shoots atop nests of partially submerged roots—have traditionally combatted the impact of rising sea levels by countering erosion and dissipating storm surges. Coral reefs furnish similar protection. Sand dunes provide a natural fortification, while wetland restorations create a buffer. There are also many “hybrid” strategies such as rutted walls congenial to vegetation, and rock sills supportive of oyster beds.  These options, too, reduce reliance on the forms of hardened armoring that impose the greatest ecological costs.

As a policy option, however, living shoreline strategies face two disadvantages. The first is the longer time horizon for return on investment. A concrete seawall begins to generate benefits immediately, while natural-shoreline alternatives attain maximum benefit only after a period of years.  This delay in value is ultimately offset by the need to augment or replace hardened armoring as sea levels continue to rise; the protective capacity of natural barriers “rise” naturally along with sea-level and thus have a longer lifespan. However, the natural bias of popular political processes to value short over long-term gains and to excessively discount future costs handicaps “living shorelines” relative to its competitors.

The second is the diffuse benefits that living shorelines confer. Obviously, they protect coastal property residents. But they also confer value on a wide-range of third parties—individuals who enjoy natural beach habitats, but also businesses such as tourism and the fishing that depended on the ecological systems disrupted by armoring. 

In addition, the value of coastal property will often be higher in a community that makes extensive use of “living shorelines”, which tend to be more aesthetically pleasing then concrete barriers and bunkers.  But the individual property owner who invests in erecting and maintaining a living shoreline alternative won’t enjoy this benefit unless other owners in his or her residential area take the same action.  As with any public good, the private incentive to contribute will lag behind the social benefit.

The remedy for overcoming these two challenges is to simultaneously widen and target public appreciation of the benefits of  natural shoreline protections. The constituencies that would enjoy the externalized benefits of natural shoreline strategies—particularly the commercial ones—must be alerted to the stake they have in the selection of this form of coastal property protection.  Likewise, business interests, including construction firms, must furnished with a vivid appreciation of the benefits they could realize by servicing the demand for “living shorelines” protections, including both their creation and their maintenance.  Recognizing that local coastal property owners lack adequate incentives to invest in natural coastline protections on their own, these interests could be expected to undertake the burden of advocating supplemental public investments. The voice of these groups in public deliberations will help to offset the natural tendency of democratic processes to overvalue short- over longer-term interests—as would the participation of financial institutions and other actors that naturally discount the current value of community assets and business appropriately based on the anticipated need for future infrastructure support. The prospect of public subsidies can in turn be used to reinforce the incentives of local property owners, whose consciousness of the prospect of widespread use of natural shoreline protections will supply them with motivation to support public provisioning and to make the necessary personal investments necessary to implement this form of climate adaptation.

The project is geared toward stimulating these processes of public engagement.  By furnishing the various constituencies involved with the forms of information most suited to enabling their recognition of the benefits of natural shoreline strategies, the project will elevate the profile of this strategy and put it on an equal footing with hardened armoring in public deliberations aimed at identifying the best, science-informed policies for protecting communities from rising sea levels and other climate impacts.

4.  Evidence-based science communication and living shorelines. . . . .

[T]he challenge of elevating the profile of “living shorelines” features the same core structural elements that have been the focus of CCP’s science-communication support research on behalf of Southeast Florida Regional Climate Compact. Science communication, this work suggests, should be guided by a “multi-public” model.  First are proximate information evaluators: typically government decisionmakers, their primary focus is on the content of policy-relevant science. Next are intermediate evaluators, who consist largely of organized nongovernmental groups, including ones representing formal and informal networks of local businesses, local property owners, and environmental and conservation organizations: their focus is primarily on how proposed policies affect their distinctive goals and interests. Finally there are remote evaluators: ordinary citizens, whose engagement with policy deliberations is only intermittent and who use heuristic strategies to assure themselves of the validity of the science that informs proposed policies.

The current project will use this model to guide development of communication materials suited to the public constituencies whose engagement is essential to elevating the deliberative profile of “living shorelines.”  Proximate evaluators here comprise the government officials—mainly county land use staff but also elected municipal officials—and also homeowners, including homeowner associations, in a position to make personal investments in “living shorelines” protections. With respect to these information consumers, the project would focus on maximizing comprehension  of the information embodied in TNC’s computer simulations. Existing research identifies systematic differences in how people engage quantitative information. Experimental studies would be conducted to fashion graphic presentation modes that anticipate these diverse information-processing styles.

The intermediate evaluators in this context consists of the wide range of private groups that stand to benefit indirectly from significant investment in “living shorelines.”  These groups will be furnished information in structured deliberations that conform to validated protocols for promoting open-minded engagement with scientific information. 

These sessions, moreover, will themselves be used to generate materials that can be used to develop information appropriate for remote evaluators. Research conducted by CCP in field-based science communication initiatives suggests that the most important cue that ordinary citizens use to assess major policy proposals is the position of other private citizens they view as social competent and informed and whose basic outlooks they share.  In particular, the attitude that these individuals evince through their words and actions vouches for the validity of policy-relevant science that ordinary members of the public do not have either the time or expertise to assess on their own.

From experience in previous evidence-based science communication projects, CCP has learned that interactions taking the form of the proposed structured deliberations among intermediate evaluators furnish a rich source of material for fashioning materials that can be used to perform this vouching function.  The participants in such deliberations are highly likely to possess the characteristics and backgrounds associated with the socially competent, knowledgeable sources whose vouching for policy-relevant science helps orient ordinary citizens.

Moreover, the participants in such sessions are likely to be socially diverse.  This feature of such sessions is highly desirable because the identity of individuals who perform this critical vouching function, work in and outside the lab confirms, varies across diverse cultural subcommunities. In addition, being able to see individuals who perform this role within one community deliberating constructively with their counterparts in others assures ordinary citizens from all of these communities that positions on the issue at hand are not associated with membership in competing cultural groups. This effect, CCP field research suggests, has been instrumental to the success of the diverse member communities of the Southeast Florida Climate Compact in protecting their deliberations from the influences that polarize citizens generally over climate change science.

Accordingly, using methods developed in earlier field work, CCP will use the intermediate evaluator deliberations to develop video and other materials that can be used to test how members of the public react as they learn about “living shorelines” as a policy option for their communities. The results of such tests can then be incorporated into communication materials geared to generating positive, self-reinforcing forms of interactions among the members of those communities.

Finally, evidence of the positive interactions of all these groups can be used to help form the state of shared expectations necessary to assure that “living shorelines” receive attention in public deliberation commensurate with the value they can confer on the well-being of communities that use this option. . . .


CCP Lab Meeting # 9073 ... 


Another day, another lecture

This one at Annenberg Public Policy Center last week, to discuss progress in one of our collaborative initiatives: evidence-based science documentary filmmaking.

We got to talk about the Pakistani Dr & Kentucky Farmer, of course, and also how much Krista would like a cool documentary on evolution.

Slides here.


Making sense of the " 'hot hand fallacy' fallacy," part 1

It never fails! My own best efforts (here & here) to explain the startling and increasingly notorious paper by Miller & Sanjurjo have prompted the authors to step forward and try to restore the usual state of perfect comprehension enjoyed by the 14.3 billion regular readers of this blog. They have determined, in fact, that it will take three separate guest posts to undo the confusion, so apparently I've carried out my plan to a [GV]T. 

As cool as the result of the M&S paper is, I myself remain fascinated by what it tells us about cognition, particularly among those with exquisitely fine-tuned statistical intuitions.  How did the analytical error they uncovered in the classic "hot hand fallacy" studies remain undetected for some thirty years, and why does it continue to provoke stubborn resistance on the part of very very smart people??  To Miller & Sanjurjo's credit, they have happily and persistently shouldered the immense burden of explication necessary to break the grip of the pesky intuition that their result "just can't be right!"

 Joshua B. Miller & Adam Sanjurjo

Thanks for the invitation to post here Dan!

Here’s our plan for the upcoming 3 posts:

  1.  Today’s plan: A bit of the history of the hot hand fallacy, then clearly stating the bias we find, explaining why it invalidates the main conclusion of the original hot hand fallacy study (1985), and further, showing that correcting for the bias flips the conclusion of the original data, so that it now can be used as evidence supporting the existence of meaningfully large hot hand shooting.

  2. Next post: Provide a deeper understanding of how the bias emerges.

  3. Final post: Go deeper into potential implications for research on the hot hand effect, hot hand beliefs, and the gambler’s fallacy.

Part I

In the seminal hot hand fallacy paper, Gilovich, Vallone and Tversky (1985; “GVT”, also see the 1989 Tversky & Gilovich “Cold Facts” summary paper) set out to conduct a truly informative scientific test of hot hand shooting. After studying two types of in game shooting data, they conducted a controlled shooting study (experiment) with the Cornell University men’s and women’s basketball teams. This was an effective "...method for eliminating the effects of shot selection and defensive pressure" that were present as confounds in their analysis of game data (we will return to the issue of game data in a follow up post; for now click to the first page of Dixit & Nalebuff’s 1991 classic book “Thinking Strategically”, and this comment on Andrew Gelman’s blog).  While the common use of the term “hot hand” shooting is vague and complex, everybody agrees that it refers to a temporary elevation in a player’s ability, i.e. the probability of a successful shot.  Because hot state is unobservable to the researcher (perhaps not the player/teammate/coach!), we cannot simply measure a player’s probability of success in the hot state; we need an operational definition.  A natural idea is to take a streak of sufficient length as a good signal of whether or not a player is in the hot state, and define a player as having the hot hand if his/her probability of success is greater after a streak of successful shots (hits), than after a streak of unsuccessful shots (misses).  GVT designed a test for this.

Adam Sanjurjo enjoying snacks in green room before Oprah Winfrey show appearanceSuppose we wanted to test whether Stephen Curry has the hot hand; how would we apply GVT’s test to Curry?  The answer is that we would have Curry attempt 100 shots at locations from which he is expected to have a 50% chance of success (like a coin).  Next, we would calculate Curry’s field goal percentage on those shots that immediately follow a streak of successful shots (hits), and test whether it is bigger than his field goal percentage on those shots that immediately follow a streak of unsuccessful shots (misses); the larger the difference that we observe, the stronger the evidence of the hot hand.  GVT performed this test on the Cornell players, and found that this difference in field goal percentages was statistically significant for only one of the 26 players (two sample t-test), which is consistent with the chance variation that the coin model predicts.

Now, one can ask oneself: if Stephen Curry doesn’t get hot, that is, for each of his 100 shot attempts he has exactly a 50% chance of hitting his next shot, then what would I expect his field goal percentage to be when he is on a streak of three (or more) hits? Similarly, what would I expect his field goal percentage to be when he is on a streak of three (or more) misses?

Following GVT’s analysis, one can form two groups of shots:

Group “3hits”: all shots in which the previous three shots (or more) were a hit,

Group “3misses”: all shots in which the previous three shots (or more) were a miss,

M&S working paper (5000th printing; currently sold out)From here, it is natural to reason as follows: if Stephen Curry always has the same chance of success, then he is like a coin, so we can consider each group of shots as independent; after all, each shot has been assigned at random either to one of three groups: “3hits,” “3misses,” or neither.  So far this reasoning is correct.  Now, GVT (implicitly) took this intuitive reasoning one step further: because all shots, which are independent, have been assigned at random to each of the groups, we should expect the field goal percentages to be the same in each group.  This is the part that is wrong.

Where does this seemingly fine thinking go wrong?  The first clue that there is a problem is that the variable that is being used to assign shots to groups is also showing up as a response variable in the computation of the field goal percentage, though this does not fully explain the problem.  The key issue is that there is a bias in how shots are being selected for each group.  Let’s see this by first focusing on the “3hits” group. Under the assumptions of GVT’s statistical test, Stephen Curry has a 50% chance of success on each shot, i.e. he is like a coin: heads for hit, and tails for miss.  Now, suppose we plan on flipping a coin 100 times, then selecting at random among the flips that are immediately preceded by three consecutive heads, and finally checking to see if the flip we selected is a heads, or a tails. Now, before we flip, what is the probability that the flip we end up selecting is a heads?  The answer is that this probability is not 0.50, but 0.46!  Herein lies the selection bias.  The flips that are being selected for analysis are precisely Joshua Miller, in Las Vegas after winning $5 million from economists who accepted his challenge to bet against P(H|HHH) < P(H) when sampling from finite sequence of coin tossesthe flips that are immediately preceded by three consecutive heads.  Now, returning to the world of basketball shots, this way of selecting shots for analysis implies that for the “3hits” group, there would be a 0.46 chance that the shot we are selecting is a hit, and for the “3misses” group, there would be a 0.54 chance that the shot we are selecting is a hit.

Therefore, if Stephen Curry does not get hot, i.e. if he always has a 50% chance of success for the 100 shots we study, we should expect him to shoot 46% after a streak of three or more hits, and 54% after a streak of three or more misses.  This is the order of magnitude of the bias that was built into the original hot hand study, and this is the bias that is depicted in Figure 2 on page 13 of our new paper, and a simpler version of this figure is below. This bias is large in basketball terms: a difference of more than 8 percentage points is nearly the difference between the median NBA Three Point shooter, and the very best.   Another way to look at this bias is to imagine what would happen if we were to invite 100 players to participate in GVT’s experiment, with each player shooting from positions in which the chance of success on each shot were 50%.  For each player check to see if his/her field goal percentage after a streak of three or more hits is higher than his/her field goal percentage after a streak of three or more misses.  For how many players should we expect this to be true? Correct answer: 40 out of 100 players. 

This selection bias is large enough to invalidate the main conclusion of GVT's original study, without having to analyze any data.  However, beyond this “negative” message, there is also a way forward.  Namely, we can re-analyze the original Cornell dataset, but in a way invulnerable to the bias.  It turns out that when  we do this, we find considerable evidence of the hot hand in this data. First, if we look at Table 4 in GVT (page 307), we see that, on average, players shot around 3.5 percentage points better when on a hit streak of three or more shots, and that 64% of the players shot better when on a hit streak than when on a miss streak. While GVT do not directly analyze these summary averages, given our Adam Sanjurjo Hermida, professional tennis player currently ranked 624th in world. Very hot hand predicted by M&S sometime in April 2016knowledge of the bias, they are telling (in fact, you can do much more with Table 4; see Kenny LJ respond to his own question here).  With the correct analysis (described in the next post), there is statistically significant evidence of the hot hand in the original data set, and, as can be seen in Table 2 on page 23 of our new paper, the point estimate of the average hot hand effect size is large (further details in our “Cold Shower” paper here). If one adjusts for the bias, what one now finds is that: (1) hitting a streak of three or more shots in a row is associated with an expected 10 percentage points boost in a player’s field goal percentage, (2) 76% of players have a higher field goal percentage when on a hit vs. miss streak, (3) and 4 out of 26 players have a large enough effect to be individually significant by conventional statistical standards (p<.05), which itself is a statistically significant result on the number of significant effects, by conventional standards. 

In a later post, we will return to the details of GVT’s paper, and talk about the evidence for the hot hand found across other datasets. If you prefer not to wait, please take a look at our Cold Shower paper, and related comments on Gelman’s blog).

In the next installment, we will discuss the counter-intuitive probability problem that reveals the bias, and explain what is driving the selection bias there.  We will then discuss some common misconceptions about the nature of the selection bias, and some very interesting connections with classic probability paradoxes.


Weekend update: talking it up & listening too

Reports on road shows:

1. Carnegie Melon PCR series:

Great event! Passionate, curious, excited audience eager to contribute to the project of fixing the science communication problem.

This is the future of the Liberal Republic of Science: a society filled with culturally diverse citizens whose common interest in enjoying the benefit of all the knowledge their way of life makes possible is secured by scientists, science communication professionals, educators, and public officials using and extending the "new political science" of science communication.


Slides here.

2. 10th Annual Conference on Empirical Legal Studies:

I did presentation on "'Ideology' or 'Situation Sense?'," the CCP study on interaction of cultural worldviews and legal reasoning in public, law students, lawyers & judges, respectively.  Lots of great feedback.

Slides here.

A small selection of other papers definitely worth taking look at (very frustrating element of a conference like this is having to choose between concurrent sessions featuring really interesting stuff):

Chen, Moskowitz & Shue, Decision-Making Under the Gambler's Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires
Thorley, Green et al., Please Recuse Yourself: A Field Experiment Exploring the Relationship between Campaign Donations and Judicial Recusal
MacDonald, Fagan & Geller, The Effects of Local Police Surges on Crime and Arrests in New York City
Ramsayer, Nuclear Power and the Mob: Extortion and Social Capital in Japan
Scurich, Jurors’ Presumption of Innocence: Impact on Cumulative Evidence Evaluation and Verdicts
Sommers, Perplexing Public Attitudes Toward Consent: Implications for Sex, Law, and Society
Robertson, 535 Felons? An Empirical Investigation into the Law of Political Corruption 
Baker & Malani, Do Judges Really Care About Law? Evidence from Circuit Split Data 



On the road *again*...

talk today at CMU, 5:30:




What's the deal w/ Norwegian public opinion on climate change?? What's the deal with ours?

Was just reading a really cool article, Aasen, M. The polarization of public concern about climate change in Norway., Climate Policy (2015), advance online publication.

Constructing Individualism and Egalitarian scales with items from Norwegian Gallup polls conducted between 2003-11, Aasen does find that both dispositions predict differences in concern w/ climate change -- less for former, more for latter.  

Climate change concern was measured with the single item ‘How concerned are you about climate change?’ The response categories were ‘Quite concerned’, ‘Very concerned’, ‘A little concerned’, and ‘Not at all concerned.'" Assuming, as seems certain!, that Norwegians have attitudes about climate change, it's pretty safe to expect a single item like this to tap into it in the same that the Industrial Strength Risk Perception Measure would.  Aasen likely handicapped her detection of the strength of the influences she measured, however, by dichotomizing this measure ("Quite concerned" & "very concerned" vs. "a little concerned" & "Not at all") rather than treating it as a 4 point ordinal one.

Aasen's "individualism" scale was apparently substantially more reliable than her "egalitarianism" one  (the α's are reported as "> 0.70" and "> 0.30," respectively).  But assuming the indicators have the requisite relationship with the underlying disposition, low reliability doesn't bias results; it just attenuates the strength of them.

So it's pretty cool to now see evidence of the same sorts of cultural divisions in Norway as we see in the US (Kahan et al. 2012), UK (Kahan et al. 2015), Australia (Guy, Kashima & O'Neill 2014), & Switzerland (Yi et al. 2015), etc.  Maybe Aasen will follow up by adapting the "cultural cognition worldview" scales for Norwegian sample!

But what really got my attention was the overall level of concern in the sample:

Yes, "individualism" and "Hierarchy" (the attitude opposite in valence to "egalitarianism") predict a steeper decline in concern after 2007, and obviously explain a lot more variance in 2011 than in 2003.

But look, first,  at how modest" concern" was even for most "egalitarian" and "communitarian" (opposite of individualistic) respondents; and, second, the universality of the decline in concern since 2007.


The climate-concern item seems to be the international equivalent of a Gallup item that asks U.S. respondents "how worried" they are about "global warming" or "climate change" ("great deal," "fair amount," "only a little," or "not at all").  Here's what U.S. responses (combining the equivalent response categories) look like (with the period the overlaps w/ Aasen's data bounded by dotted lines):


You can see that the divide along "individualist-communitarian" and "egalitarian-hierarchy" lines in Norway is less extreme than the Democrat-Republican one in the U.S.  Actually, if we had data for the U.S. respondents' cultural worldviews, the greater degree of polarization in the U.S. would be shown to be even more substantial. 

But again, that's not as intriguing to me is what the data show about the relative levels of "concern"/"worry" in the two nations.  The U.S. population is not particularly "worried" on average, but apparently Norwegians are even less "concerned," as can be see by this composite graphic, which charts the corresponding sets of responses for both nations, respectively, in the years for which there are data (note: Aasen supplied me with the Norwegian means; this Figure supercedes a slightly but not materially different one reflecting estimates from the model presented in the paper):

The trends are very comparable, and maybe the question wording or some cross-cultural exchange rate in how respondents indicate their attitudes explains the gap.

But clearly (by this measure at least) Norway is not more concerned than the U.S., which according to common wisdom "leads the world in climate denial."  

Indeed, the segment of society most culturally predisposed to worry about climate change in Norway is no more concerned than the "average" American.

So what's going on in that country?!

Maybe we can entice Aasen into a guest post.  I've already offered her the standard MOP$50,000.00 fee (payable in future stock options in CCP, Inc.), but I'm confident she, like other guests, will waive the fee to affirm that enlarging human knowledge is their only motivation for being a scholar  (of course, there is still ambiguity, given the fame & celebrity endorsements, particularly in Macao, that comes with being a CPP Blog guest poster).

We'll see what she says!

But for meantime, this very interesting & cool paper supplies material for a fresh lesson about the dangers of "selecting on the dependent variable" in the science of science communication: If one tests one's theory of U.S. public opinion on climate change by considering only how well it "fits" the data in the U.S., then obviously one will be excluding the possibility of observing both comparable states of public opinion in societies where the asserted explanation ("balanced media norms," a creeping public "anti-science" sensibility, Republican brains, etc.) doesn't apply and divergent states of public opinion in societies in which the asserted explanation applies just as well (Shehata & Hopmann 2012).


Aasen, M. The polarization of public concern about climate change in Norway., Climate Policy (2015), advance online publication.

Guy, S., Kashima, Y., Walker, I. & O'Neill, S. Investigating the effects of knowledge and ideology on climate change beliefs. European Journal of Social Psychology 44, 421-429 (2014).

Kahan, D.M., Hank, J.-S., Tarantola, T., Silva, C. & Braman, D. Geoengineering and Climate Change Polarization: Testing a Two-Channel Model of Science Communication. Annals of the American Academy of Political and Social Science 658, 192-222 (2015).

Kahan, D.M., Peters, E., Wittlin, M., Slovic, P., Ouellette, L.L., Braman, D. & Mandel, G. The polarizing impact of science literacy and numeracy on perceived climate change risks. Nature Climate Change 2, 732-735 (2012).

Shehata, A. & Hopmann, D.N. Framing Climate Change: a Study of US and Swedish Coverage of Global Warming. Journalism Studies 13, 175-192 (2012).

Shi, J., Visschers, V.H.M. & Siegrist, M. Public Perception of Climate Change: The Importance of Knowledge and Cultural Worldviews. Risk Analysis 2015, advance on line.



Is there diminishing utility in the consumption of the science of science communication?

Apparently not!

Or at least not at Cornell University, where I gave 3 lectures Thurs. & had follow up meetings w/ folks Friday.

This is a university that gets the importance of integrating the practice of science and science-informed policymaking with the science of science communication.  The number of scholars across various departments in both the natural and social sciences who are applying themselves to this objective in their scholarship and pedagogy is pretty amazing.

Brief report:

No. 1 was a tallk for the Gloal Leadership Fellows affiliated with the Cornell Alliance for Science (“a global initiative for science-based communication”).  B/c the Fellows--an amazingly smart & talented group of science communication professionals & students-- were going to tail me for the rest of the day, I thought I should pose a couple of questions that they could think about & that I’d answer in later lectures. Of course, I asked them for their own answers in the meantime. Since theirs answers were, predictably, better than the ones I was going to give, I just substituted theirs for mine later in the day--who would notice, right?

The questions were:

1. Do U.S. farmers believe in climate change? &

2. Do evolution non-believers enjoy watching documentaries on human evolution?

The Fellows were very curious about these issues.

Slides here.

No. 2 was lecture to class “The GMO Debate: Science, Society, and Global Impacts.”  Title of my talk was, “Are GMOs toxic for the science communication environment? Vice versa?”  I think I might have been the first person to break the news to them that there isn’t any public contestation over GM foods in the U.S.

Slides here.

No. 3 was public lecture.  Discussed the “science communication measurement problem,” “the disentanglement principle,” and “cognitive dualism & communicative pluralism.”

Slides here.


Can I make you curious about science curiosity? . . .

If so, then, maybe you'll staty tuned. An excerpt from something I'm working on:

. . . . As conceptualized here, science curiosity is not a transient state (see generally Lowenstein 1994), but instead a general disposition, variable in intensity across persons, that reflects the motivation to seek out and consume scientific information for personal pleasure.

A valid measure of this disposition could be expected to make to make myriad contributions to knowledge.  Such an instrument could be used to improve science education, for example, by facilitating investigation of the forms of pedagogy most likely to promote the development of science curiosity and harness it to promote learning (Blalock, Lichtenstein, Owen & Pruski 2008).  A science curiosity measure could likewise be used by science journalists, science filmmakers, and similar professionals to perfect the appeal of their work to those individuals who value it the most (Nisbet & Aufdheide 2009). Those who study the science of science communication (Fischhoff & Scheufele 2014; Kahan 2015) could also use a science curiosity measure to deepen their understanding of how public interest in science shapes the responsiveness of democratically accountable institutions to policy-relevant evidence.

Indeed, the benefits of measuring science curiosity are so numerous and so substantial that it would be natural to assume researchers must have created such a measure long ago.  But the plain truth is that they have not.  “Science attitude” measures abound. But every serious attempt to assess their performance has concluded that they are psychometrically weak and, more importantly, not genuinely predictive of what they are supposed to be assessing—namely, the disposition to seek out and consume scientific information for personal satisfaction.

We report the results of a reasearch measure consciously designed to remedy this research deficit....


Blalock, C.L., Lichtenstein, M.J., Owen, S., Pruski, L., Marshall, C. & Toepperwein, M. In Pursuit of Validity: A comprehensive review of science attitude instruments 1935–2005. International Journal of Science Education 30, 961-977 (2008).

Fischhoff, B. & Scheufele, D.A. The science of science communication. Proceedings of the National Academy of Sciences 110, 14031-14032 (2013).

Loewenstein, G. The psychology of curiosity: A review and reinterpretation. Psychological bulletin 116, 75 (1994).
Nisbet, M.C. & Aufderheide, P. Documentary Film: Towards a Research Agenda on Forms, Functions, and Impacts. Mass Communication and Society 12, 450-456 (2009).







Coming soon ... the Science Curiosity Index/Ludwick Quotient

Been busy at work on CCP "Evidence-based Science Filmmaking Initiative" (ESFI), and hence neglecting the 14 billion readers of blog... Sorry!

Am hoping what we will have to say on the progress we've been making will compensate.  More on that soon-- very soon.

But just to feed you enough information to prevent utter starvation, the coolest thing so far is a behaviorally validated Science Curiosity Index (SCI), which measures the disposition to seek out & consume science information for personal satisfaction.  It's amazing what one learns about science curiosity, which is definitely not the same thing as the science-comprehension disposition measured by Ordinary Science Intelligence, tells us about how people process information about contested science issues.

Some of us in the lab have taken to calling the SCI measure the "Ludwick Quotient" (LQ).

But more soon-- very soon, I promise!


New paper: Expressive rationality & misperception of facts

A comment on Lee Jussim's Social Perception and Social Reality: Why Accuracy Dominates Bias and Self-Fulfilling Prophecy (Oxford 2012).


This comment uses the dynamic of identity-protective cognition to pose a friendly challenge to Jussim (2012). The friendly part consists of an examination of how this form of information processing, like many of the ones Jussim describes, has been mischaracterized in the decision science literature as a “cognitive bias”: in fact, identity-protective cognition is a mode of engaging information rationally suited to the ends of the agents who display it. The challenging part is the manifest inaccuracy of the perceptions that identity-protective cognition generates. At least some of the missteps induced by the “bounded rationality” paradigm in decision science reflect its mistaken assumption that the only thing people use their reasoning for is to form accurate beliefs. Jussim’s critique of the bounded-rationality paradigm, the comment suggests, appears to rest on the same mistaken equation of rational information processing with perceptual accuracy.


"I was wrong?! Coooooooooool!"

Okay—now here’s a model for everyone who aspires to cultivate the virtues that signify a genuine scholarly disposition.

As discussed previously (here & here), a pair of economists have generated quite a bit of agitation and excitement by exposing an apparent flaw in the methods of the classic “hot hand fallacy” studies.

 These studies purported to show that, contrary to popular understanding not only among sports fans but among professional athletes and coaches, professional basketball players do not experience “hot streaks,” or periods of above-average performance longer in duration than one would expect to see by chance.  The papers in questions have for thirty years enjoyed canonical status in the field of decision science research as illustrations of the inferential perils associated with the propensity of human beings to look for and see patterns in independent events.

Actually, the reality of that form of cognitive misadventure isn’t genuinely in dispute.  People are way too quick to discern signal in noise.

But what is open to doubt now is whether the researchers  used the right analytical strategy in testing whether this mental foible is the source of the widespread impression that professional basketball players experience "hot hands."

I won’t rehearse the details—in part to avoid the amusingly embarrassing spectacle of trying to make intuitively graspable a proof that stubbornly assaults the intuitions of highly numerate persons in particular—but the nub of the  proof supplied by the challenging researchers, Joshua Miller & Adam Sanjurjo, is that the earlier researchers mistakenly treated “hit” and “missed” shots as recorded in a previous, finite sequence of shots as if they were independent. In fact, because the proportion of “hits” and “misses” in a past sequence is fixed, strings of “hits” should reduce the likelihood of subsequent “hits” in the remainder of the sequence. Not taking this feature of sampling without replacement into account caused the original “hot hand fallacy” researchers to miscalculate the “null" in a manner that overstated the chance probability that a player would hit another shot after a specified string of hits....

Bottom line is that the data in the earlier studies didn’t convincingly rule out the possibility that basketball players’ performances did indeed display the sort of “streakiness” that defies chance expectations and supports the “hot hand” conjecture.

But in any case . . . the point of this update is to call attention to the truly admirable and inspiring reaction of the original researchers to the news that their result had been called into question in this way.

As I said, the “hot hand fallacy” studies are true classics. One could understand if those who had authored such studies would react defensively (many others who have been party to celebrating the studies for the last 30 yrs understandably have!) to the suggestion that the studies reflect a methodological flaw, one that itself seems to reflect the mischief of an irresistible but wrong intuition about how to distinguish random from systematic variations in data.

But instead, the reaction of the lead researcher to the M&S result, Tom Gilovich, is: “Coooool!!!!!!!!”

“Unlike a lot of stuff that’s come down the pike since 1985,” Gilovich was quoted as saying in a Wed. Wall Street Journal piece,

this is truly interesting,” Gilovich said. “What they discovered is correct.” Whether the real effect is “so small that the original conclusion stands or needs to be modified,” he said, “is what needs to be determined. Whether the real effect is “so small that the original conclusion stands or needs to be modified,” he said, “is what needs to be determined.”

The article goes on to report that Gilovich, along with others, is now himself contemplating re-analyses and new experiments to try to do exactly that.

In a word, Gilovich, far from have his nose bent out of joint by the M&S finding, is excited that aruly unexpected development is now furnishing him and others with a chance to resume investigation of an interesting and complex question.

I bet, too, that at least part of what intrigues Gilovich is how a mistake like this could have evaded the attention to decision scientists for this long –-and why even now the modal reaction among readers of the M&S paper is “BS!!” It takes about 45.3 (± 7) readings to really believe M&S’s proof, and even then the process has to be repeated at weekly intervals for a period of two months before the point they are making itself starts to seem intuitive enough to have the ring of truth.

But the point is, Gilovich, whose standing as a preeminent researcher is not diminished one iota by this surprising turn in the scholarly discussion his work initiated, has now enriched us even more by furnishing us with a compelling and inspiring example of the mindset of a real scholar!

Whatever embarrassment he might have been expected to experience (none is warranted in my view, nor evident in the WSJ article), is dwarfed by his genuine intellectual excitement over a development that is truly cool & interesting—both for what it teaches us about a particular problem in probability and for the opportunity it furnishes to extent examination into human psychology (here, the distinctive vulnerability to error that likely is itself unique to people with intuitions fine-tuned to avoid making the mistakes that intuitions characteristically give rise to when people try to make sense of randomness).

I’m going to try to reciprocate the benefit of the modeling of scholarly virtue Gilovich is displaying by owning up to, and getting excited about, as many mistakes in my own previous work as I can find! 



Why do we seem to agree less & less as we learn more & more-- and what should we do about that?

from correspondence ...

Dear Prof Kahan,
I’m working on an article describing how our ideologies skew our ability to deal with the facts, no matter how true/scientifically sound they are. While researching this, I (obviously ;) landed upon your research. I’ve been eagerly reading papers and posts on Cultural Cognition –site, but  there are couple of things I’m still unsure of. Namely:
1 How does cultural cognition differ from motivated reasoning? Or is the latter included in the former; thus motivated reasoning is merely cultural cognition ”in action”?
An account here
Also see this 
2 Are smart people more prone to twist given facts so that they fit into their existing beliefs/values? Or are intelligent persons just moreskillful in this process...?  
I think latter.  That is, I don't think the reason various critical reasoning proficiencies magnfiy cultural cognition is that they are correlated with a greater stake or unconscious motivation to form identity-protective beliefs;  individuals who are better than average in critical reasoning aren't more partisan or intensely partisan when one measures those things in them. I think they are just better at doing what people naturally do with information that helps them to form "beliefs" that express who they are.  Our motivated numeracy paper is in line w/ that interpretation.
3 Is motivated reasoning unconsious reaction? Do we know we do it? Does everybody do it, even the ones who try not to?
That's the theory, & I believe the evidence supports it; well-designed experiments have for sure connected motivated-reasoning dynamics to unconscious processes.
Knowing doesn't seem to help, no. One can't "observe" the effect of the dynamic in oneself, much less control it. I'm sure, though, that one can behave in ways that anticipate the effect -- trying to manage the conditions under which one examines information, & also being conscious when an issue is of the sort about which one's beliefs might well have been influenced in this way & taking that into account in acting 
4 If motivated reasoning is unconsious (= automatic), how on earth do we stop it? Can we?
I have to confess this whole phenomena bothers me to the bone, both as a human being and (especially) as a science journalist. How can we, how can anyone promote rational ideas or actions or work towards the kind of society s/he thinks is worthwhile, if s/he doesn’t first know how thing are, thus is able to take in the facts?
The only grounds any of us ever has for confidence in our perception of what is in fact known to science is the reliability of the faculties we use to recognize who knows what about what.  Those faculties are vulnerable to disruption by one or another form of social pathology.  We can attend to those pathologies; we all have an interest in that no matter what our cultural worldviews or our positions on particular issues. 
I would appreciate enormously, if you found a minute answering me.
With kind regards,
By enabling free and reasoning people to understand what science can teach us about how members of a pluralistic liberal democratic society come to know the vast amount of scientific knowledge that their way of life makes possible, you are a critical part of the solution. Thanks, & good luck w/ your story. 




Am I doing the right thing? . . . The “chick-sexing” disanalogy

Okay, here’s a set of reflections that seem topical as another school year begins.

The reflections can be structured with reference to a question:

What’s the difference between a lawyer and a chick sexer?

It’s not easy, at first, to figure out what they have in common.  But once one does, the risk that one won’t see what distinguishes them is much bigger, in actuarial and consequential terms.

I tell people about the link between them all the time—and they chuckle.  But in fact, I spend hours and hours and hours per semester eviscerating comprehension of the critical distinction between them in people who are filled with immense intelligence and ambition, and who are destined to occupy positions of authority in our society.

That fucking scares me.

Anyway, the chick sexer is the honey badger of cognitive psychology: relentlessly fascinating, and adorable. But because cognitive psychology doesn’t have nearly as big a presence on Youtube as do amusing voice-overs of National Geographic wildlife videos, the chick sexer is a lot less famous. 

So likely you haven’t heard of him or her.

But in fact the chick sexer plays a vital role in the poultry industry. It’s his or her responsibility to separate the baby chicks, moments after birth, on the basis of gender.

The females are more valuable, at least from the point of view of the industry. They lay eggs.  They are also plumper and juicier, if one wants to eat them. Moreover, the stringy scrawny males, in addition to being not good for much, are ill-tempered & peck at the females, steal their food, & otherwise torment them.

So the poultry industry basically just gets rid of the males (or the vast majority of them; a few are kept on and lead a privileged existence) at soonest opportunity—minutes after birth.

The little newborn hatchlings come flying (not literally; chickens can’t fly at any age) down a roomful of conveyor belts, 100’s per minute. Each belt is manned (personed) by a chick sexer, who deftly plucks (as in grabs; no feathers at this point) each chick off the belt, quickly turns him/her over, and in a split second determines the creature’s gender, tossing the males over his or her shoulder into a “disposal bin” and gently setting the females back down to proceed on their way.

They do this unerringly—or almost unerringly (99.99% accuracy or whatever).

Which is astonishing. Because there’s no discernable difference, or at least one that anyone can confidently articulate, in the relevant anatomical portions of the minutes-old chicks.

You can ask the chick sexer how he or she can tell the difference.  Many will tell you some story about how a bead of sweat forms involuntarily on the male chick beak, or how he tries to distract you by asking for the time of day or for a cigarette, or how the female will hold one’s gaze for a moment longer or whatever. 

This is all bull/chickenshit. Or technically speaking, “confabulation.”

Indeed, the more self-aware and honest members of the profession just shrug their shoulders when asked what it is that they are looking for when they turn the newborn chicks upside down & splay their little legs.

But while we don’t know what exactly chicksexers are seeing, we do know how they come to possess their proficiency in distinguishing male from female chicks: by being trained by a chick-sexing grandmaster.

For hours a day, for weeks on end, the grandmaster drills the aspiring chick sexers with slides—“male,” “female,” “male,” “male,” “female,” “male,” “female,” “female”—until they finally acquire the same power of discernment as the grandmaster, who likewise is unable to give a genuine account of what that skill consists in.

This is a true story (essentially).

But the perceptive feat that the chick sexer is performing isn’t particularly exotic.  In fact, it is ubiquitous.

What the chick sexer does to discern the gender of chicks is an instance of pattern recognition.

Pattern recognition is a cognitive operation in which we classify a phenomenon by rapidly appraising it in comparison to large stock of prototypes acquired by experience.

The classification isn’t made via conscious deduction from a set of necessary and sufficient conditions but rather tacitly, via a form of perception that is calibrated to detect whether the object possesses a sufficient number of the prototypical attributes—as determined by a gestalt, “critical mass” intuition—to count as an instance of it.

All manner of social competence—from recognizing faces to reading others emotions—depend on pattern recognition.

But so do many do specialized ones. What distinguishes a chess grandmaster from a modestly skilled amature player isn’t her capacity to conjure and evaluate a longer sequence of potential moves but rather her ability to recognize favorable board positions based on their affinity to a large stock of ones she has determined by experience to be advantageous.

Professional judgment, too, depends on pattern recognition.

For sure, being a good physician requires the capacity and willingness to engage in conscious and unbiased weighing of evidence diagnostic of medical conditions. But that’s not sufficient; unless the doctor includes only genuinely plausible illnesses in her set of maladies worthy of such investigation, the likelihood that she will either fail to test for the correct, one fail to identify it soon enough to intervene effective, will be too low.

Expert forensic auditors must master more than the technical details of accounting; they must acquire a properly calibrated capacity to recognize the pattern of financial irregularity that helps them to extract evidence of the same from mountains of business records.

The sort of professional judgment one needs to be a competent lawyer depends on a properly calibrated capacity for pattern recognition, too.

Indeed, this was the key insight of Karl Llewellyn.  The most brilliant member of the Legal Realist school, Llewellyn observed that legal reasoning couldn’t plausibly be reduced to deductive application of legal doctrines. Only rarely were outcomes uniquely determined by the relevant set of formal legal materials (statutes, precedents, legal maxims, and the like).

Nevertheless, judges and lawyers, he noted, rarely disagree on how particular cases should be resolved. How this could be fascinated him!

The solution he proposed was professional “situation sense”: a perceptive faculty, acquired by education and experience, that enabled lawyers to reliably appraise specific cases with reference to a stock of prototypical “situation types,” the proper resolution of which that was governed by shared apprehensions of “correctness” instilled by the same means.

This feature of Llewellyn’s thought—the central feature of it—is weirdly overlooked by many scholars who characterize themselves as “realists” or New Realists,” and who think that Llewellyn’s point was that because there’s no “determinacy” in “law,” judges must be deciding on the basis of “political” sensibilities of the conventional “left-right” sort, generating differences in outcome across judges of varying ideologies. 

It’s really hard to get Llewellyn more wrong than that!

Again, his project was to identify how there could be pervasive agreement among lawyers and judges on what the law is despite its logical indeterminacy. His answer was that members of the legal profession, despite heterogeneity in their “ideologies” politically understood, shared a form of professionalized perception—“situation sense”—that by and large generated convergence on appropriate outcomes the coherence of which would befuddle non-lawyers.

Llewellyn denied, too, that the content of situation sense admitted of full specification or articulation. The arguments that lawyers made and the justifications that judges give for their decisions, he suggested, were post hoc rationalizations.  

Does that mean that for Lewellyn, legal argument is purely confabulatory? There are places where he seems to advance that claim.

But the much more intriguing and I think ultimately true explanation he gives for the practice of reason-giving in lawyerly argument (or just for lawyerly argument) is its power to summon and focus “situation sense”: when effective, argument evokes both apprehension of the governing “situation” and motivation to reach a situation-appropriate conclusion.

Okay. Now what is analogous between lawyering and chick-sexing should be readily apparent.

The capacity of the lawyer (including the one who is a judge) to discern “correct” outcomes as she grasps and manipulates indeterminate legal materials is the professional equivalent of—and involves the exercise of the same cognitive operation as—the chicksexer’s power to apprehend the gender of the day-old chick from inspection of its fuzzy, formless genetalia.

In addition, the lawyer acquires her distinctive pattern-recognition capacity in the same way the chick sexer acquires his: through professional acculturation.

What I do as a trainer of lawyers is analogous to what the chicksexer grandmaster does.  “Proximate causation,” “unlawful restraint of trade,” “character propensity proof/permissible purpose,” “collateral (not penal!) law”—“male,” “male,” “female,” “male”: I bombard my students with a succession of slides that feature the situation types that stock the lawyer’s inventory, and inculcate in students the motivation to conform the results in particular cases to what those who practice law recognize—see, feel—to be the correct outcome.

It works. I see it happen all the time. 

It’s quite amusing. We admit students to law school in large part because of their demonstrated proficiency in solving the sorts of logic puzzles featured on the LSAT. Then we torment them, Alice-in-Wonderland fashion, by presenting to them as “paradigmatic” instances of legal reasoning outcomes that clearly can’t be accounted for by the contorted simulacra of syllogistic reasoning that judges offer to explain them. 

They stare uncomprehendingly at written opinions in which a structural ambiguity is resolved one way in one statute and the opposite way in another--by judges who purport to be following the “plain meaning” rule.

They throw their hands up in frustration when judges insist that their conclusions are logically dictated by patently question-begging standards  (“when the result was a reasonably foreseeable consequence of the defendant’s action. . .  “) that can be applied only on the basis of some unspecified, and apparently not even consciously discerned, extra-doctrinal determination of the appropriate level of generality at which to describe the relevant facts.

But the students do learn—that the life of the law is not “logic” (to paraphrase, Holmes, a proto-realist) but “experience,” or better, perception founded on the “experience” of becoming a lawyer, replete with all the sensibilities that being that sort of professional entails.

The learning is akin to the socialization process that the students all experienced as they negotiated the path from morally and emotionally incompetent child to competent adult. Those of us who are already socially competent model the right reactions for them in our own reactions to the materials—and in our reactions to the halting and imperfect attempts of the students to reproduce it on their own. 

“What,” I ask in mocking surprise, “you don’t get why these two cases reached different results in applying the ‘reasonable foreseeability’ standard of proximate causation?” 

Seriously, you don’t see why, for an arsonist to be held liable for causing the death of firefighters, it's enough to show that he could ‘reasonably foresee’ 'death by fire,' whether or not he could foresee  ‘death by being trapped by fires travelling the particular one of 5x10^9 different paths the flames might have spread through a burning building'?! But why ‘death by explosion triggered by a spark emitted from a liquid nitrate stamping machine when knocked off its housing by a worker who passed out from an insulin shock’—and not simply 'death by explosion'—is what must be "foreseeable" to a manufacturer (one warned of explosion risk by a safety inspector) to be convicted for causing the death of employees killed when the manufacturer’s plant blew up? 

"Anybody care to tell Ms. Smith what the difference is,” I ask in exasperation.

Or “Really,” I ask in a calculated (or worse, in a wholly spontaneous, natural) display of astonishment,

you don’t see why somoene's ignorance of what's on the ‘controlled substance’ list doesn’t furnish a "mistake of law" defense (in this case, to a prostitute who hid her amphetamines in tin foil wrap tucked in her underwear--is that where you keep your cold medicine or ibuprofen! Ha ha ha ha ha!!), but why someone's ignorance of the types of  "mortgage portfolio swaps" that count as loss-generating "realization events" under IRS regs (the sort of tax-avoidance contrivance many of you will be paid handsomely by corporate law form clients to do) does furnish one? Or why ignorance of the criminal prohibition on "financial structuring" (the sort of strategem a normal person might resort to to hide assets from his spouse during a divorce proceeding) furnishes a defense as well?!

Here Mr. Jones: take my cellphone & call your mother to tell her there’s serious doubt about your becoming a lawyer. . . .

This is what I see, experience, do.  I see my students not so much “learning to think” like lawyers but just becoming them, and thus naturally seeing what lawyers see.

But of course I know (not as a lawyer, but as a thinking person) that I should trust how things look and feel to me only if corroborated by the sort of disciplined observation, reliable measurement, and valid causal inference distinctive of empirical investigation.

So, working with collaborators, I design a study to show that lawyers and judges are legal realists—not in the comic-book “politicians in robes” sense that some contemporary commentators have in mind but in the subtle, psychological one that Llewellyn actually espoused.

Examining a pair of genuinely ambiguous statutes, members of the public predictably conform their interpretation of them to outcomes that gratify their partisan cultural or political outlooks, polarizing in patterns the nature of which are dutifully obedient to experimental manipulation of factors extraneous to law but very relevant indeed to how people with those outlooks think about virtue and vice.

But not lawyers and judges: they converge on interpretations of these statutes, regardless of their own cultural outlooks and regardless of experimental manipulations that vary which outcome gratifies those outlooks.

They do that not because, they, unlike members of the public, have acquired some hyper-rational information-processing capacity that blocks out the impact of “motivated reasoning”: the lawyers and judges are just as divided as members of the public, on the basis of the same sort of selective crediting and discrediting of evidence, on issues like climate change, and legalization of marijuana and prostitution.

Rather the lawyers and judges converge because they have something else that members of the public don’t: Llewellyn’s situation sense—a professionalized form of perception, acquired through training and experience, that reliably fixes their attention on the features of the “situation” pertinent to its proper legal resolution and blocks out the distracting allure of features of it that might be pertinent to how a non-lawyer—i.e., a normal person, with one or another kind of “sense” reliably tuned to enabling them to be a good member of a cultural group on which their status depends . . . .

So, that’s what lawyers and chick sexers have in common: pattern recognition, situation sense, appropriately calibrated to doing what they do—or in a word professional judgment.

But now, can you see what the chick sexer and the lawyer don’t have in common?

Perhaps you don’t; because even in the course of this account, I feel myself having become an agent of the intoxicating, reason-bypassing process that imparting “situation sense” entails.

But you might well see it—b/c here all I’ve done is give you an account of what I do as opposed to actually doing it to you.

We know something important about the chick sexer’s judgment in addition to knowing that it is an instance of pattern recognition: namely, that it works.

The chick sexer has a mission in relation to a process aimed at achieving a particular end.  That end supplies a normative standard of correctness that we can use not only to test whether chick sexers, individually and collectively, agree in their classifications but also on whether they are classifying correctly.

Obviously, we’ll have to wait a bit, but if we collect rather than throw half of them a way, we can simply observe what gender the baby chicks classified by the sexer as “male” and “female” grow up to be.

If we do that test, we’ll find out that the chick sexers are indeed doing a good job.

We don’t have that with lawyers’ or judges’ situation sense.  We just don’t.

We know they see the same thing; that they are, in the astonishing way that fascinated Llewellyn, converging in their apprehension of appropriate outcomes across cases that “lay persons” lack the power to classify correctly.

But we aren’t in a position to test whether they are seeing the right thing.

What is the goal of the process the lawyers and judges are involved in?  Do we even agree on that?

I think we do: assuring the just and fair application of law.

That’s a much more general standard, though, than “classifying the gender of chicks.”  There are alternative understandings of “just” and “fair” here.

Actually, though, this is still not the point at which I’m troubled.  Although for sure I think there is heterogeneity in our conceptions of the “goals” that the law aims at, I think they are all conceptions of a liberal political concept of “just” and “fair,” one that insists that the state assume a stance of neutrality with respect to the diverse understandings of the good life that freely reasoning individuals (or more accurately groups of individuals) will inevitably form.

But assuming that this concept, despite its plurality of conceptions, has normative purchase with respect to laws and applications of the same (I believe that; you might not, and that’s reasonable), we certainly don’t have a process akin to the one we use for chick sexers to determine whether lawyers and judges’ situation sense is genuinely calibrated to achieving it.

Or if anyone does have such a process, we certainly aren’t using it in the production of legal professionals.

To put it in terms used to appraise scientific methods, we know the professional judgment of the chick sexer is not only reliable—consistently attuned to whatever it is that appropriately trained members of their craft are unconsciously discerning—but also valid: that is, we know that the thing the chick sexers are seeing (or measuring, if we want to think of them as measuring instruments of a special kind) is the thing we want to ascertain (or measure), viz., the gender of the chicks.

In the production of lawyers, we have reliability only, without validity—or at least without validation.  We do successfully (remarkably!) train lawyers to make out the same patterns when they focus their gaze at the “mystifying cloud of words” that Cardozo identified the law as comprising. But we do nothing to assure that what they are discerning is the form of justice that the law is held forth as embodying.

Observers fret—and scholars using empirical methods of questionable reliability and validity purport to demonstrate—that judges are mere “politicians in robes,” whose decisions reflect the happenstance of their partisan predilections.

That anxiety that judges will disagree based on their “ideologies” bothers me not a bit.

What does bother me—more than just a bit—is the prospect that the men and women I’m training to be lawyers and judges will, despite the diversity of their political and moral sensibilities, converge on outcomes that defy the basic liberal principles that we expect to animate our institutions.

The only thing that I can hope will stop that from happening is for me to tell them that this is how it works.  Because if it troubles me, I have every reason to think that they, as reflective decent people committed to respecting the freedom & reason of others, will find some of this troubling too.

Not so troubling that they can’t become good lawyers. 

But maybe troubling enough that they won't stop being reflective moral people in their careers as lawyers; troubling enough so that if they find themselves in a position to do so, they will enrich the stock of virtuous-lawyer prototypes that populate our situation sense by doing something that they, as reflective, moral people—“conservative” or “liberal”—recognize is essential to reconciling being a “good lawyer” with being a member of a profession essential to the good of a liberal democratic regime.

That can happen, too.


How big a difference in mean CRT scores is "big enough" to matter? or NHT: A malignant craft norm, part 2

1.   Now where was I . . . ? 

Right . . . So yesterday I posted part I of this series, which is celebrating the bicentennial , or perhaps it’s the tricentennial—one loses track after a while--of the “NHT Fallacy” critique

The nerve of it is that “rejection of the null [however it is arbitrarily defined] at p < 0.05 [or p < 10^-50 or whatever]” furnishes no inferentially relevant information in hypothesis testing. To know whether an observation counts as evidence in support of a hypothesis, the relevant information is not how likely we were to observe a particular value if the “null” is true but how much more or less likely we were to observe that value if a particular hypothesized true “value” is correct than if another hypothesized “true” value is correct (e.g., Roseboom 1960; Edwards, Lindman & Savage 1963; Cohen 1994; Goodman 1999a;  Gigerenzer 2004). 

Actually, I’m not sure when the first formulation of the critique appeared.  Amusingly, in his 1960 classic The Fallacy of the Null-hypothesis Significance Test, Rosenbloom, apologetically characterized his own incisive attack on the inferential barrenness of NHT as “not a particularly original view”!

The critique has been refined and elaborated many times, in very useful ways, since then, too.  Weirdly, the occasion for so many insightful elaborations has been the persistence of NHT despite the irrefutable proofs of those critiquing it.

More on that in in a bit, but probably the most interesting thing that has happened in the career of the critique in the last 50 yrs. or so has been the project to devise tractable alternatives to NHT that really do quantify the evidentiary weight of any particular set of data. 

I’m certainly not qualified to offer a reliable account of the intellectual history of using Bayesian likelihood ratios as a test statistic in the social sciences (cf. Good.  But the utlity of this strategy was clearly recognized by Rozenboom, who observed that the inferential defects in NHT could readily be repaired by analytical tools forged in the kiln of “the classic theory inverse probabilities.”

The “Bayes Factor” –actually “the” misleadingly implies that there is only one variant of it—is the most muscular, deeply theorized version of the strategy. 

But one can, I believe, still get a lot of mileage out of less technically elaborate analytical strategies using likelihood ratios to assess the weight of the evidence in one’s data (e.g., Goodman, 1999b). 

For many purposes, I think, the value of using Bayesian likelihood ratios is largely heuristic: having to specify the predictions that opposing plausible hypotheses would generate with respect to the data, and to formulate an explicit measure of the relative consistency of the observed outcome with each, forces the researcher to do what the dominance of NHT facilitates the evasion of: the reporting of information that enables a reflective person to draw an inference about the weight of the evidence in relation to competing explanations of the dynamic at issue. 

That’s all that’s usually required for others to genuinely learn from and critically appraise a researcher’s work. For sure there are times when everything turns on how precisely one is able to estimate  some quantity of interest, where key conceptual issues about how to specify one or another parameter of a Bayes Factor will have huge consequence for interpretation of the data.

But in lots of experimental models, particularly in social psychology, it’s enough to be able to say “yup, that evidence is definitely more consistent—way more consistent—with what we’d expect to see if H1 rather than H2 is true”—or instead, “wait a sec, that result is not really any more supportive of that hypothesis than this one!” In which case, a fairly straightforward likelihood ratio analysis can, I think, add a lot, and even more importantly avoid a lot of the inferential errors that accompany permitting authors to report “p < 0.05” and then make sweeping, unqualified statements not supported by their data.

That’s exactly the misadventure, I said “yesterday,” that a smart researcher experienced with NHT.  That researcher found a “statistically significant” correlation (i.e., rejection of the “null at p<0.0xxx”) between a sample of Univ of Ky undergraduate’s CRT scores (Frederick 2005) and their responses to a standard polling question on “belief in” evolution; he then treated that as corroboration of his hypothesis that “individuals who are better able to analytically control their thoughts are more likely” to overcome the intuitive attraction of the idea that “living things, are ... intentionally designed by some external agent” to serve some “function and purpose,” and thus “more likely to eventually endorse evolution’s role in the diversity of life and the origin of our species."

But as I pointed out, the author’s data, contrary to his assertion, unambiguously didn’t support that hypothesis.

Rather than showing that “analytic thinking consistently predicts endorsement of evolution,” his data demonstrated that knowing the study subjects’ CRT scores furnished absolutely no predictive insight into their "evolution beliefs."  The CRT predictor in the author’s regression model was “statistically significant” (p < 0.01), but was way too small in size to outperform a “model” that simply predicted “everyone” in the author’s sample—regardless of their CRT score—rejected science’s account of the natural history of human beings.  

(Actually, there were even more serious—or maybe just more interesting—problems having to do with the author’s failure to test the data's relative support for a genuine alternative about how cognitive reflection relates to "beliefs" in evolution: by magnifying the opposing positions of groups for whom "evolution beliefs" have become (sadly, pointlessly, needlessly) identity defining. But I focused “yesterday” on this one b/c it so nicely illustrates the NHT fallacy.)

Had he asked the question that his p-value necessarily doesn’t address—how much more consistent is the data with one hypothesis than another—he would have actually found out that the results of his study was more consistent with the hypothesis that “cognitive reflection makes no goddam difference” in what people say when they answer a standard “belief in evolution” survey item of the sort administered by Gallup or Pew.

The question I ended on, then, was,

How much more or less probable is it that we’d observe the reported difference in believer-nonbeliever CRT scores if differences in cognitive reflection do “predict” or “explain” evolution beliefs among Univ. Ky undergrads than if they don't?

That’s a very complicated and interesting question, and so now I’ll offer my own answer, one that uses the inference-disciplining heuristic of forming a Bayesian likelihood ratio.

2 provisos:

1. Using a Baysian likelihood ratio is not, in my view, the only device that can be used to extract from data like these the information necessary to form cogent inferences about the support fo the data for study hypotheses.  Anything that helps the analyst and reader guage the relative support of the data for the study hypothesis in relation to a meaningful or set of meaningful alternatives can do that.

Often it will be *obvious* how the data do that, given the sign of the value observed in the data or the size of it in relation to what common understanding tells one the competing hypotheses would predict.

But sometimes those pieces of information might not be so obvious, or  might be open to debate. Or in any case, there could be circumstances in which extracting the necessary information is not so straightforward and in which a device like forming a Bayesian likelihood ratio in relation to the competing hypotheses helps, a lot, to figure out what the inferential import of the data are.

That's the pragmatic position I mean to be staking out here in advocating alternatives to the pernicious convention of permitting researchers to treat "p < 0.05" as evidence in support of a study hypothesis.

2. My "Bayesian likelihood ratio" answer here is almost surely wrong! 

But it is at least trying to answer the right question, and by putting it out there, maybe I can entice someone else who has a better answer to share it.

Indeed, it was exactly by enticing others into scholarly conversation that I came to see what was cool and important about this question.   Without implying that they are at all to blame for any deficiencies in this analysis, it’s one that emerged from my on-line conversations with Gordon Pennycook, who commented on my original post on this article, and my off-line ones with Kevin Smith, who shared a bunch of enlightening thoughts with me in correspondence relating to a post that I did on an interesting paper that he co-authored.

2.   What sorts of differences can the CRT reliably measure? 

Here’s the most important thing to realize: the CRT is friggin hard!

It turns out that the median score on the CRT, a three-question test, is zero when administered to the general population.  I kid you not: studies w/ general population samples (not student or M Turk or ones to sites that recruit from visitors to a website that offers to furnish study subjects with information on the relationship between their moral outlooks and their intellectual styles) show that 60% of the subjects can't get a single answer correct.

Hey, maybe 60% of the population falls short of the threshold capacity in conscious, effortful information processing that critical reasoning requires.  I doubt that but it's possible.

What that means, though, is that if we use the CRT in a study (as it makes a lot of sense to do; it’s a pretty amazing little scale), we necessarily can't get any information from our data on differences  in cognitive reflection among a group of people comprising 60% of the population.   Accordingly, if we had two groups neither of whose mean scores were appreciably above the "population mean," we'd be making fools of ourselves to think we were observing any real difference: the test just doesn't have any measurement precision or discrimination at that "low" a level of the latent disposition.

We can be even more precise about this -- and we ought to be, in order to figure out how "big" a difference in mean CRT scores would warrant saying stuff like "group x is more reflective than group y" or "differences in cognitive reflection 'predict'/'explain' membership in group x as opposed to y...."

Using item response theory, which scores the items on the basis of how likely a person with any particular level of the latent disposition (theta) is to get that particular item correct, we can assess the measurement precision of an assessment instrument at any point along theta.  We can express that measurement precision in terms of a variable "reliability coefficient," which reflects what fraction of the differences in individual test scores in that vicinity of theta is attributable to "true differences" & how much to measurement error.

Here's what we get for CRT (based on a general population sample of about 1800 people):

The highest degree of measurement precision occurs around +1 SD, or approximately "1.7" answers correct.  Reliability there is 0.60, which actually is pretty mediocre; for something like the SAT, it would be pretty essential to have 0.8  along the entire continuum from -2 to +2 SD.  That’s b/c there is so much at stake, both for schools that want to rank students pretty much everywhere along the continuum, and for the students they are ranking. 

But I think 0.60 is "okay" if one is trying to make claims about groups in general & not rank individuals. If one gets below 0.5, though, the correlations between the latent variable & anything else will be so attenuated as to be worthless....

So here are some judgments I'd make based on this understanding of the psychometric properties of CRT:

  • If the "true" mean CRT scores of two groups -- like "conservatives" & "liberals" or "evolution believers" & "disbelievers" -- are both within the red zone, then one has no reasonable grounds for treating the two as different in their levels of reflection: CRT just doesn't have the measurement precision to justify the claim that the higher-scoring group is "more reflective “even if the difference in means is "statistically significant."

  • Obviously, if one group's true mean is in the red zone and another's in the green or yellow, then we can be confident the two really differ in their disposition to use conscious, effortful processing.

  • Groups within the green zone probably can be compared, too.  There's reasonable measurement precision there-- although it's still iffy (alpha is about 0.55 on avg...).

If I want to see if groups differ in the reflectiveness, then, I should not be looking to see if the difference in their CRT scores is "significant  p < 0.05," since that by itself won't support any inferences relating to the hypotheses given my guidelines above.

If one group has a "true" mean CRT score that is in the "red" zone, the hypothesis that it is less reflective than another group can be supported with CRT results only if the latter group's "true" mean score is in the green zone.

3.  Using likelihood ratios to weigh the evidence on “whose is bigger?” 

So how can we can this information to form a decent hypothesis testing strategy here?

Taking the "CRT makes no goddam difference" position, I'm going to guess that those who "don't believe" in evolution are pretty close to the population mean of "0.7."  If so, then those who "do believe" will need to have a “true” mean score of +0.5 SD or about "1.5 answers correct" before there is a "green to red" zone differential.

That's a difference in mean score of approximately "0.8 answers correct."

Thus, the "believers more reflective" hypothesis, then, says we should expect to find that believers will have a mean score 0.8 points higher than the population mean, or 1.5 correct.

The “no goddam difference” hypothesis, we’ll posit, predicts the "null": no difference whatsoever in mean CRT scores of the believers & nonbelievers.

Now turning to the data, it turns out the "believers" in author’s sample had a mean CRT of 0.86, SEM = .07.  The "nonbelievers" had a mean CRT score of 0.64, SEM =0.05.

I calculate the a difference as 0.22, SEM = 0.08.

Again, it doesn’t matter that  this difference is “statistically significant”—at p < 0.01 in fact.  What we want to know is the inferential import of this data for our competing hypotheses. Which one does it support more—and how much more supportive is it?

As indicated at the beginning, a  really good (or Good) way to gauge the weight of the evidence in relation to competing study hypotheses is through the use of Bayesian likelihood ratios.  To calculate them, we look at where the observed difference in mean CRT scores falls in the respective probability density distributions associated with the “no goddam difference” and “believers more reflective” hypotheses.

By comparing how probable it is that we’d observe such a value under each hypothesis, we get the Bayesian likelihood ratio, which is how much more consistent the data are with one hypothesis than the other:

The author’s data are thus roughly 2000 times more consistent with the “no goddam difference” prediction than with the “believers more reflective” prediction.

Roughly! Figuring out the exact size of this likelihood ratio is not important.

All that matters—all I’m using the likelihood ratio, heuristically, to show—is that we can now see that, given what we know CRT is capable of measuring among groups whose scores are so close to the population mean, that the size of the observed difference in mean CRT scores is orders of magnitude more consistent with the  “no goddam difference” hypothesis than with the “believers more reflective” hypothesis, notwithstanding its "stastical significance."

That’s exactly why it’s not a surprise that a predictive model based on CRT scores does no better than a model that just uses the population (or sample) frequency to predict whether any given student (regardless of his or her CRT scores) believes in in evolution.

Constructing a Bayesian likelihood ratio here was so much fun that I’m sure you’ll agree we should do it one more time. 

In this one, I’m going to re-analyze data from another study I recently did a post on: Reflective liberals and intuitive conservatives: A look at the Cognitive Reflection Test and ideology,” Judgment and Decision Making, July 2015, pp. 314–331, by Deppe, Gonzalez, Neiman, Jackson Pahlke, the previously mentioned Kevin Smith & John Hibbing.

Here the authors reported data on the correlation between CRT scores and individuals identified with reference to their political preferences.  They reported that CRT scores were negatively correlated (p < 0.05) with various conservative position “subscales” in various of their convenience samples, and with a “conservative preferences overall” scale in a stratified nationally representative sample.  They held out these results as “offer[ing] clear and consistent support to the idea that liberals are more likely to be reflective compared to conservatives.”

As I pointed out in my earlier post, I thought the authors were mistaken in reporting that their data showed any meaningful correlation—much less a statistically significant one—with “conservative preferences overall” in their nationally representative sample; they got that result, I pointed out, only because they left 2/3 of the sample out of their calculation.

I did point out, too, that the reported correlations seemed way to small, in any case, to support the conclusion that “liberals” are “more reflective” than conservatives.  It was Smith’s responses in correspondence that moved me to try to formulate in a more systematic way an answer to the question that a p-value, no matter how miniscule, begs: namely, just “how big” a difference two groups “true” mean CRT scores has to be before one can declare one to be “more reflective,” “analytical,” “open-minded,” etc. than the another.

Well, let’s use likelihood ratios to measure the strength of the evidence in the data in just the 1/3 of the nationally representative sample that the authors used in their paper.

Once more, I’ll assume that “conservatives” are about average in CRT—0.7. 

So again, the "liberal more reflective" hypothesis predicts we should expect to find that liberals will have a mean score 0.8 points higher than the population mean, or 1.5 correct.   That’s the minimum difference for group means on CRT necessary to register a difference for a group to be deemed more reflective than another whose scores are close to the population mean.

Again, the “no goddam difference” hypothesis predicts the "null": here no difference whatsoever in mean CRT scores of liberal & conservatives.

By my calculation, in the subsample of the data in question “conservatives” in (individuals above mean on the “conservative positions overall” scale) have a mean CRT of 0.55, SE = 0.08; “liberals” a mean score of 0.73, SE = 0.08.

The estimated difference (w/ rounding) in means is 0.19, SE = 0.09.

So here is the likelihood ratio assessment of the relative support of the evidence for the two hypotheses:

Again, the data are orders of magnitude more consistent with “makes no goddam difference.”

Once more, whether the difference is “5x10^3” or 4.6x10^3 or even 9.7x10^2 or 6.3x10^4 is not important. 

What is is that there’s clearly much much much more reason for treating this data as supporting an inference diametrically opposed to the one drawn by the authors.

Or at least there is if I’m right about how to specify the range of possible observations we should expect to see if the “makes no goddam difference” hypothesis is true and the range of possible observations we should expect to see if the “liberals are more reflective than conservatives” hypotheses is true. 

Are those specifications correct?

Maybe not!  They're just the best ones I can come up with for now! 


If someone sees a problem & better still a more satisfying solution, it would be very profitable to discuss that! 


What's not even worth discussing, though, is that "rejecting the null at p<0.05" is the way to figure out if the data supports the strong conclusions these papers purport to draw-- becaues in fact, that information does not support any particular inference on its own.

4.  What to make of this

The point here isn’t to suggest any distinctive defects in these papers, both of which actually report interesting data.

Again, these are just illustrations of the manifest deficiency of NHT, and in particular the convention of treating “rejection of the null at p < 0.05”—by itself! – as license for declaring the observed data as supporting a hypothesis, much less as “proving” or even furnishing “strong,” “convincing” etc. evidence in favor of it.

And again in applying this critique to these particular papers, and in using Bayesian likelihood ratios to liberate the inferential significance locked up in the data, I’m not doing anything the least bit original!

On the contrary, I’m relying on arguments that were advanced over 50 years ago, and that have been strengthened and refined by myriad super smart people in the interim.

For sure, exposure of the “NHT fallacy” reflected admirable sophistication on the part of those who developed the critique. 

But as I hope what I’ve showing the last couple of posts is that the defects in NHT that these scholars identified is really really easy to understand. Once it’s been pointed out; any smart middle schooler can readily grasp it!

So what the hell is going on?

I think the best explanation for the persistence of the NHT fallacy is that it is a malignant craft norm

Treating “rejection of the null at p < 0.05” as license for asserting support of one’s hypothesis is “just the way the game works,” “the way it’s done.” Someone being initiated into the craft can plainly see that in the pages of the leading journals, and in the words and attitudes—the facial expressions, even—of the practitioners whose competence and status is vouched for by all of their NHT-based publications and by the words, and attitudes (and even facial expressions even) of other certified members of the field.

Most of those who enter the craft will therefore understandably suppress whatever critical sensibilities might otherwise have altered them to the fallacious nature of this convention. Indeed, if they can’t do that, they are likely to find the path to establishing themselves barred by jagged obstacles.

The way to progress freely down the path is to produce and get credit and status for work that embodies the NHT fallacy.  Once a new entrant gains acceptance that way, then he or she too acquires a stake in the vitality of the convention, one that not only reinforces his or her aversion to seriously interrogating studies that rest on the fallacy but that also motivates him or her to evince thereafter the sort of unquestioning, taken-for-granted assent that perpetuates the convention despite its indisputably fallacious character.

And in case you were wondering, this diagnosis of the malignancy of NHT as a craft norm in the social sciences is not the least bit original to me either! It’s was Rozenboom’s diagnosis over 50 yrs ago.

So I guess we can see it’s a slow-acting disease.  But make no mistake, it’s killing its host.


Cohen, J. The Earth is Round (p < .05). Am Psychol 49, 997 - 1003 (1994).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research.Psych Rev 70, 193 - 242 (1963).

Frederick, S. Cognitive Reflection and Decision Making. Journal of Economic Perspectives 19, 25-42 (2005).

Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999a).

Goodman, S.N. Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med 130, 995 - 1004 (1999b).

Rozeboom, W.W. The fallacy of the null-hypothesis significance test. Psychological bulletin 57, 416 (1960).

Gigerenzer, G. Mindless statistics. Journal of Socio-Economics 33, 587-606 (2004).