follow CCP

Recent blog entries
popular papers

Science Curiosity and Political Information Processing

What Is the "Science of Science Communication"?

Climate-Science Communication and the Measurement Problem

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

'Ideology' or 'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment

A Risky Science Communication Environment for Vaccines

Motivated Numeracy and Enlightened Self-Government

Making Climate Science Communication Evidence-based—All the Way Down 

Neutral Principles, Motivated Cognition, and Some Problems for Constitutional Law 

Cultural Cognition of Scientific Consensus

The Tragedy of the Risk-Perception Commons: Science Literacy and Climate Change

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction 

Geoengineering and the Science Communication Environment: a Cross-Cultural Experiment

Fixing the Communications Failure

Why We Are Poles Apart on Climate Change

The Cognitively Illiberal State 

Who Fears the HPV Vaccine, Who Doesn't, and Why? An Experimental Study

Cultural Cognition of the Risks and Benefits of Nanotechnology

Whose Eyes Are You Going to Believe? An Empirical Examination of Scott v. Harris

Cultural Cognition and Public Policy

Culture, Cognition, and Consent: Who Perceives What, and Why, in "Acquaintance Rape" Cases

Culture and Identity-Protective Cognition: Explaining the White Male Effect

Fear of Democracy: A Cultural Evaluation of Sunstein on Risk

Cultural Cognition as a Conception of the Cultural Theory of Risk

« Why it is a *mistake* to draw inferences about the (in domain) reasoning of experts from studies of motivated reasoning on the part of the general public | Main | "It's Hayek ... wait; it's Honey Boo Boo ... no, it's ... wtf!": cultural cognition "black-blue dress/Einstein-Monroe" worldview assessment tool! »

So are humans very good at designing computer programs to predict Supreme Court decisions? You tell me!

I’ve posted previously about the quality of “computer models” developed by political scientists for predicting judicial decisions by the U.S. Supreme Court. So this is, in part, an update, in which I report what I’ve learned since then.

As explained previously, the models are integral to the empirical proof that these scholars offer in favor of their hypothesis that judicial decisoinmaking generally is driven by “ideology” rather than “law.”

That proof is “observational” in nature—i.e., it relies not on experiments but on correlational analyses that relate case outcomes to various “independent variables” or predictors.  Those predictors, of course, include “ideology” (measured variously by the party of the President who appointed the sitting judges, the composition of the Senate that conferred them, and, in the case of the Supreme Court, the Justices’ own subsequent voting records) the “statistical significance” of which, “controlling for” the other predictors, is thought to corroborate the hypothesis that judges are indeed relying on “ideology” rather than “law” in making decisions.

Commentators have raised lots of pretty serious objections going to the internal validity of these studies. Among the difficulties are sampling biases arising from the decision of litigants to file or not file cases (Kasteller & Lax 2008), and outcome “coding” decisions that (it is said) inappropriately count as illicit “ideological” influences what actually are perfectly legitimate differences of opinion among judges over which legally relevant considerations should be controlling in particular areas of law (Edwards & Livermore 2008; Shapiro 2009, 2010).

But the main issue that concerns me is the external validity of these studies: they don’t, it seems to me, predict case outcomes very well at all.

That was the point of my previous post.  In it,  I noted the inexplicable failure of scholars and commentators to recognize that a computer model that beat a group of supposed “experts” in a widely heralded (e.g., Ayers 2007) contest to predict Supreme Court decisions (Ruger et al. 2004) itself  failed to do better than chance.

It’s kind of astonishing, actually, but the reason that this evaded notice is that the scholars and commentators either didn’t get or didn’t get the significance of the (well known!) fact that the U.S. Supreme Court, which has a discretionary docket, reverses (i.e., overturns the decision of the lower court) in well over 50% of the cases. 

Under these circumstances, it is a mistake (plain and simple) to gauge the predictive power of the model by assessing whether it does better than “tossing a coin.” 

Because it is already known that the process in question disproportionately favors one outcome, the model, to have any value, has to outperform someone who simply chooses the most likely outcome—here, reverse—in all cases (Long 1997; Pampel 2000)

The greater-than 50% predictive success rate of following that obvious strategy is how well someone could be expected to do by chance. Anyone who randomly varied her decisions between “reverse” and “affirm” would do worse than chance—just like the non-expert law professors who got their asses whupped by the computer, who I have in fact befriended and learned is named  “Lexy," in the widely (and embarrassingly!) heralded contest.

The problem, as I pointed out in the post, is that Lexy’s “75%” success rate (as compared to the “expert’s” 59%) was significantly better-- practically or statistically (“p = 0.58”) -- from the 72% reversal rate for the SCOTUS Term in question.

A non-expert who had the sense to recognize that she was no expert would have correctly “predicted” 49 of the 68 decisions that year, just two fewer than Lexy managed to predict.

I was moved to write the post by an recent recounting of Lexy’s triumph in, but I figured that surely in the intervening years—the contest was 13 yrs ago!—the field would have made some advances.

A couple of scholars in the area happily guided me to a cool working paper by Katz, Bommarito & Blackmun (2014), who indeed demonstrate the considerable progress that this form of research has made.

KBB discuss the performance of  a model, whose name I’ve learned (from communication with that computer, whom I met while playing on-line poker against her) is Lexy2.

Lexy2 was fed a diet of several hundred cases decided from 1946 to 1953 (her “training set”), and then turned loose to “predict” the outcomes in 7000 more  cases from the years 1953 to 2013 (technically, that’s “retrodiction,” but same thing, since no one “told” Lexy2 how those cases came out before she guessed; they weren’t part of her training set).

Lexy2 got 70% of the case outcomes right over that time. 

KBB, to their credit (and my relief; I found it disorienting, frankly, that so many scholars seemed to be overlooking the obvious failure of Lexy1 in the big “showdown” against the “experts”), focus considerable attention on the difference between Lexy2’s predictive-success rate and the Court’s reversal rate, which they report was 60% over the period in question.

Their working paper (which is under review somewhere and so will surely be even more illuminating still when it is published) includes some really cool graphics, two of which I’ve superimposed to illustrate the true predictive value of Lexy2:

As can be seen, variability in Lexy2’s predictive success rate ("KBB" in the graphic) is driven largely by variability in the Court’s reversal rate.

Still  70% vs. 60% is a “statistically significant” difference—but with 7000+ observations, pretty much anything even 1% different from 60% would be. 

The real question  is whether the 10-percentage-point margin over chance is practically significant. 

(Of course, it's also worth pointing out that trends in the reversal rate should be incorporated into evaluation of Lexy2's performance so we can be sure her success in periods when reversal might have been persistently less frequent doesn't subsidize for predictive failure during periods when the reversal rate was persistently higher; impossible to say from eyeballing, but it kind of looks like Lexy2 did better before 1988, when the Court still had a considerable mandatory appellate jurisdiction. than it has done with today's wholly discretionary one. But leave that aside for now.)

How should we assess the practical siginificance of Lexy2's predictive acumen?

If it helps, one way to think about it  is that Lexy2 in effect correctly predicted “25%” of the cases (10% of the 40% “affirmed” cases) that “Mr. Non-expert,” who would have wisely predicted "reverse" in all cases, would have missed. Called “adjusted count R2 ,” this is a logistic regression equivalent of R2 for linear regression.

But I think an even more interesting way to gauge Lexy2’s performance this is to compare it to the original Lexy’s.

As I noted, Lexy didn’t genuinely do better than chance.

Lexy2 did, but the comparison is not really fair to the original Lexy.

Lexy2 got to compete against "Mr. Chance" (the guy who predicts reverse in every case) for 60 terms, during which the average number of decisions was 128 cases as compared to 68 in the the single term in which Lexy competed. Lexy2 thus had a much more substantial period to prove her mettle!

So one thing we can do is see how well we'd expect Lexy2 to perform against Mr. Chance in an "average" Supreme Court Term.  

Using the 0.60 reverse rate KBB report for their prediction (or retrodiction) sample & the 0.70 prediction-success rate they report for Lexy2, I simulated 5000 "75-decision" Terms--75 being about average for the modern Supreme Court, which is very lazy in historical terms.

Here's a graphic summary of the resuts:

In the 5000 simulated 75-decision Terms, Lexy2 beats Mr. Chance in 88%. In other words, the odds are a bit better than 7:1 that in a given Term Lexy2 will rack up a score of correct predictions that exceeds Mr. Chance's by at least 1

But what if we want (for bookmaing purposes, say) to determine the spread -- that is the margin by which Lexy2 will defeat Mr. Chance in a given term? 

Remember that Lexy "won" against Mr. Chance in their one contest, but by a pretty unimpressive 3 percentage points (which with N = 68 was, of course, not even close to "significant"). 

If we look at the the distribution of outcomes in 5000 simulated 75-decision terms, Lexy2 beats Mr. Chance by 10% in 50% of the 75-decision terms & fails to beat Mr. Chance by at least 10% in 50%. Not suprising; something would definitely be wrong with the simulation if matters were otherwise! But in any given term, then, Lexy2 is "even money" at +10 pct. 

The odds of Lexy2 winning by 5% or more over Mr. Chance (4 more correct predictions in a 75-decison Term) are around 3:1.  That is, in about 75% (73.9% to be meaninglessly more exact) of the 75-decision Supreme Court Terms, Lexy2 wins by at least +5 pct.   

The odds are about 3:1 against Lexy2 beating Mr. Chance by 15 pct points. 

Obviously the odds are higher than 3:1 that Lexy2 will eclipse the 3-pct-point win eked out by the original Lexy in her single contest with Mr. Chance. The odds of that are, according to this simulation, about 5:1. 

But what if we want to test the relative strength of the competing hypotheses (a) that “Lexy 2 is no better than the original 9001 series Lexy” and (b) that Lexy2 enjoys, oh, a “5-pct point advantage over Lexy” in a 75-decision term? 

To do that, we have to figure out the relative likelihood of the observed data-- that is, the results reported in KBB -- under the competing hypotheses.  Can we do that?  

Well, consider:


Here I've juxtaposed the probability distributions associated with  the "Lexy2 is no better than Lexy" hypothesis and  the "Lexy2 will outperform Lexy by 5 pct points" hypothesis. 

The proponents of those hypotheses are asserting that on average Lexy2 will beat “Mr. Chance” by 3%, Lexy’s advantage in her single term of competition, and 8% (+5% more), respectively, in a 75-decision term. 

Those “averages” are means that sit atop of probability distributions characterized by standard errors of 0.09, which is by my calculation (corroborated, happily!, by the simulation) of the difference in success rates for both 0.72 and 0.75, on the one hand, and 0.60, on the other, for a 75-decision Term. 

The ratio of the densities at 0.10, the observed data, for the "Lexy2 +5 " hypothesis & the "Lexy2 no better"  hypothesis is 1.2.  That's the equivalent of the Bayesian likelihood ratio, or the factor by which we should update our prior odds of Hypothesis 2 rather than Hypothesis 1 being correct (Goodman 1999, 2005; Edwards, Lindman & Savage 1963; Good 1985). 

That's close enough to a likelihood ratio of 1 to justify the conclusion that the evidence is really just as consistent with both hypotheses –“ Lexy2 is no better,” and “Lexy2  +5 over Lexy.” 

Is this “Bayes factor” (Goodman 1999, 2005) approach the right way to assess things? 

I’m not 100% sure, of course, but this is how I see things for now, subject to revision, of course, if someone shows me that I made a mistake or that there is a better way to think about this problem. 

In any case, the assessment has caused me to revise upward my estimation of the ability of Lexy!  I really have no good reason to think Lexy isn’t just as good as Lexy2.  Indeed, it’s not 100% clear from the graphics in KBB, but it looks to me that Lexy's 75% "prediction success" rate probably exceeded that of Lexy2 in 2002-03, the one year in which Lexy competed! 

At the same time, this analysis makes me think a tad bit less than I initially did of ability of Lexy2 (& only a tad; it's obviously an admirable thinking machine). 

Again, Lexy2, despite “outperforming” Mr. Chance by +10 pct over 60 terms, shouldn’t be expected to do any better than the original Lexy in any given Term. 

More importantly, being only a 7:1 favorite to beat chance by at least a single decision, & only a 3:1 favorite to beat chance by 4 decisions or more (+5%), in an average 75-decision Term just doesn’t strike me as super impressive. 

Or in any case, if that is what the political scientists’ “we’ve proven it: judges are ideological!” claim comes down to, it’s kind of underwhelming.  

I mean, shouldn’t we see stronger evidence of an effect stronger than that? Especially for the U.S. Supreme Court, which people understandably suspect of being “more political” than all the other courts that political scientists also purport to find are deciding cases on an ideological basis? 

It’s a result that’s sufficiently borderline, I’d say, to need help from another form of testing—like an experiment. 

No empirical method is perfect.  They are all strategies for conjuring observable proxies of process that in fact we cannot observe directly. 

Accordingly, the only “gold standard,” methodologically speaking, is convergent validity: when multiple (valid) methods reinforce one another, then we can more confident in all of them; if they don’t agree, then we should wary about picking just one as better than another. 

The quest for convergent validity was one of the central motivations for our study—discussed in my post “yesterday” to probe the “ideology thesis”—the political science conclusion, based on observational studies—via experimental methods. 

That our study (Kahan, Hoffman, Evans, Lucci, Devins & Cheng in press) came to a result so decidedly unsupportive of the claim that judges are ideologically biased in their reasoning reinforces my conclusion that the evidence observational researchers have come up with so far doesn’t add much to whatever grounds one otherwise would have had for believing that judges are or are not “neutral umpires.” 

But I'm really not sure.  What do you think?


 Ayres, I. How computers routed the experts. Financial Times ‘FT Magazine,’ Apr. 31, 2007.

Edwards, H.T. & Livermore, M.A. Pitfalls of empirical studies that attempt to understand the factors affecting appellate decisionmaking. Duke LJ 58, 1895 (2008).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research. Psych Rev 70, 193 - 242 (1963).

Good, I.J. Weight of evidence: A brief survey. in Bayesian statistics 2: Proceedings of the Second Valencia International Meeting (ed. J.M. Bernardo, M.H. DeGroot, D.V. Lindley & A.F.M. Smith) 249-270 (Elsevier, North-Holland, 1985).

Goodman, S.N. Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials 2, 282 - 290 (2005).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of internal medicine 130, 1005-1013 (1999).

Kahan, Hoffman, Evans, Lucci, Devins & Cheng. “Ideology” or “Situation Sense”: An Experimental Investigation of Motivated Reasoning and Professional Judgment. U. Penn. L. Rev. (in press).

Kastellec, J.P. & Lax, J.R. Case selection and the study of judicial politics. Journal of Empirical Legal Studies 5, 407-446 (2008).

Katz, Daniel Martin and Bommarito, Michael James and Blackman, Josh, Predicting the Behavior of the Supreme Court of the United States: A General Approach (July 21, 2014). Available at SSRN: or

Long, J.S. Regression models for categorical and limited dependent variables (Sage Publications, Thousand Oaks, 1997).

Pampel, F.C. Logistic regression : a primer (Sage Publications, Thousand Oaks, Calif., 2000).

Shapiro, C. Coding Complexity: Bringing Law to the Empirical Analysis of the Supreme Court. Hastings Law Journal 60 (2009).

Shapiro, C. The Context of Ideology: Law, Politics, and Empirical Legal Scholarship. Missouri Law Review 75 (2010).


PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (4)

Dan -

==> "What do you think?"

I wouldn't pretend to be able understand this post anywhere near well enough to be able to comment, but I do have a question about this:

==> "Especially for the U.S. Supreme Court, which people understandably suspect of being “more political” than all the other courts that the political scientists also think are deciding cases on an ideological basis? "

Why do people suspect that SCOTUS is any more "political" than other courts? I would think it might be the other way around (perhaps in particular other courts where justices are appointed?) - although the impact of the political biases on SCOTUS would be greater.

April 13, 2015 | Unregistered CommenterJoshus


What don't you understand? It's certainly my goal to make issues like this amenable to critical reflection by any curious and intelligent person; you certain fit that description. Tell me & I will try to clarify

April 13, 2015 | Registered CommenterDan Kahan

Dan -

The discussion about statistical analysis is too technical for me to follow. I am limited to a common-sense level of understanding of statistics, on my good days. And I suspect that it would be next to impossible to bridge the more technical issues you're discussing to my level of understanding.

April 13, 2015 | Unregistered CommenterJoshua

@Joshua--then you are just goading me into trying much harder.

April 13, 2015 | Registered CommenterDan Kahan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>