## So are humans very good at designing computer programs to predict Supreme Court decisions? You tell me!

I’ve posted previously about the quality of “computer models” developed by political scientists for predicting judicial decisions by the U.S. Supreme Court. So this is, in part, an update, in which I report what I’ve learned since then.

As explained previously, the models are integral to the empirical proof that these scholars offer in favor of their hypothesis that judicial decisoinmaking generally is driven by “ideology” rather than “law.”

That proof is “observational” in nature—i.e., it relies not on experiments but on correlational analyses that relate case outcomes to various “independent variables” or predictors. Those predictors, of course, include “ideology” (measured variously by the party of the President who appointed the sitting judges, the composition of the Senate that conferred them, and, in the case of the Supreme Court, the Justices’ own subsequent voting records) the “statistical significance” of which, “controlling for” the other predictors, is thought to corroborate the hypothesis that judges are indeed relying on “ideology” rather than “law” in making decisions.

Commentators have raised lots of pretty serious objections going to the *internal validity* of these studies. Among the difficulties are sampling biases arising from the decision of litigants to file or not file cases (Kasteller & Lax 2008), and outcome “coding” decisions that (it is said) inappropriately count as illicit “ideological” influences what actually are perfectly legitimate differences of opinion among judges over which legally relevant considerations should be controlling in particular areas of law (Edwards & Livermore 2008; Shapiro 2009, 2010).

But the main issue that concerns me is the *external validity* of these studies: they don’t, it seems to me, predict case outcomes very well at all.

That was the point of my previous post. In it, I noted the inexplicable failure of scholars and commentators to recognize that a computer model that beat a group of supposed “experts” in a widely heralded (e.g., Ayers 2007) contest to predict Supreme Court decisions (Ruger et al. 2004) itself *failed* to do better than chance.

It’s kind of astonishing, actually, but the reason that this evaded notice is that the scholars and commentators either didn’t get or didn’t get the significance of the (well known!) fact that the U.S. Supreme Court, which has a discretionary docket, reverses (i.e., overturns the decision of the lower court) in well over 50% of the cases.

Under these circumstances, it is a mistake (plain and simple) to gauge the predictive power of the model by assessing whether it does better than “tossing a coin.”

Because it is already known that the process in question disproportionately favors one outcome, the model, to have any value, has to outperform someone who simply chooses the most likely outcome—here, *reverse*—in all cases (Long 1997; Pampel 2000).

The greater-than 50% predictive success rate of following that obvious strategy is how well someone could be expected to do *by chance*. Anyone who randomly varied her decisions between “reverse” and “affirm” would do *worse than chance*—just like the non-expert law professors who got their asses whupped by the computer, who I have in fact befriended and learned is named “Lexy," in the widely (and embarrassingly!) heralded contest.

The problem, as I pointed out in the post, is that Lexy’s “75%” success rate (as compared to the “expert’s” 59%) was significantly better-- practically or statistically (“*p = *0.58”) -- from the *72%* reversal rate for the SCOTUS Term in question.

A non-expert who had the sense *to recognize* that she was no expert would have correctly “predicted” 49 of the 68 decisions that year, just two fewer than Lexy managed to predict.

I was moved to write the post by an recent recounting of Lexy’s triumph in 538.com, but I figured that surely in the intervening years—the contest was 13 yrs ago!—the field would have made some advances.

A couple of scholars in the area happily guided me to a cool working paper by Katz, Bommarito & Blackmun (2014), who indeed demonstrate the considerable progress that this form of research has made.

KBB discuss the performance of a model, whose name I’ve learned (from communication with that computer, whom I met while playing on-line poker against her) is Lexy2.

Lexy2 was fed a diet of several hundred cases decided from 1946 to 1953 (her “training set”), and then turned loose to “predict” the outcomes in 7000 more cases from the years 1953 to 2013 (technically, that’s “retrodiction,” but same thing, since no one “told” Lexy2 how those cases came out before she guessed; they weren’t part of her training set).

Lexy2 got 70% of the case outcomes right over that time.

KBB, to their credit (and my relief; I found it disorienting, frankly, that so many scholars seemed to be overlooking the obvious failure of Lexy1 in the big “showdown” against the “experts”), focus considerable attention on the difference between Lexy2’s predictive-success rate and the Court’s reversal rate, which they report was 60% over the period in question.

Their working paper (which is under review somewhere and so will surely be even more illuminating still when it is published) includes some really cool graphics, two of which I’ve superimposed to illustrate *the true predictive value* of Lexy2:

As can be seen, variability in Lexy2’s predictive success rate ("KBB" in the graphic) is driven largely by variability in the Court’s reversal rate.

Still 70% vs. 60% is a “statistically significant” difference—but with 7000+ observations, pretty much anything even 1% different from 60% would be.

The real question is whether the 10-percentage-point margin over chance is practically significant.

(Of course, it's also worth pointing out that* trends* in the reversal rate should be incorporated into evaluation of Lexy2's performance so we can be sure her success in periods when reversal might have been persistently less frequent doesn't subsidize for predictive failure during periods when the reversal rate was persistently higher; impossible to say from eyeballing, but it kind of looks like Lexy2 did better before 1988, when the Court still had a considerable mandatory appellate jurisdiction. than it has done with today's wholly discretionary one. But leave that aside for now.)

How should we assess the practical siginificance of Lexy2's predictive acumen?

If it helps, one way to think about it is that Lexy2 in effect correctly predicted “25%” of the cases (10% of the 40% “affirmed” cases) that “Mr. Non-expert,” who would have wisely predicted "reverse" in *all *cases, would have missed. Called “adjusted count *R*^{2} ,” this is a logistic regression equivalent of *R*^{2} for linear regression.

But I think an even more interesting way to gauge Lexy2’s performance this is to* compare it* to the original Lexy’s.

As I noted, Lexy didn’t genuinely do better than chance.

Lexy2 *did*, but the comparison is not really fair to the original Lexy.

Lexy2 got to compete against "Mr. Chance" (the guy who predicts reverse in every case) for* 60 terms*, during which the *average* number of decisions was 128 cases as compared to 68 in the the single term in which Lexy competed. Lexy2 thus had a much more substantial period to prove her mettle!

So one thing we can do is see how well we'd expect Lexy2 to perform against Mr. Chance in an "average" Supreme Court Term.

Using the 0.60 reverse rate KBB report for their prediction (or retrodiction) sample & the 0.70 prediction-success rate they report for Lexy2, I simulated 5000 "75-decision" Terms--75 being about average for the modern Supreme Court, which is very lazy in historical terms.

Here's a graphic summary of the resuts:

In the 5000 simulated 75-decision Terms, Lexy2 beats Mr. Chance in 88%. In other words, the odds are a bit better than 7:1 that in a given Term Lexy2 will rack up a score of correct predictions that exceeds Mr. Chance's by *at least 1*.

But what if we want (for bookmaing purposes, say) to determine the spread -- that is the *margin* by which Lexy2 will defeat Mr. Chance in a given term?

Remember that Lexy "won" against Mr. Chance in their one contest, but by a pretty unimpressive 3 percentage points (which with N = 68 was, of course, not even close to "significant").

If we look at the the distribution of outcomes in 5000 simulated 75-decision terms, Lexy2 beats Mr. Chance by 10% in 50% of the 75-decision terms & fails to beat Mr. Chance by at least 10% in 50%. *Not* suprising; something would definitely be wrong with the simulation if matters were otherwise! But in any given term, then, Lexy2 is "even money" at +10 pct.

The odds of Lexy2 winning *by 5%* or more over Mr. Chance (4 more correct predictions in a 75-decison Term) are around 3:1. That is, in about 75% (73.9% to be meaninglessly more exact) of the 75-decision Supreme Court Terms, Lexy2 wins by at least +5 pct.

The odds are about 3:1 *against* Lexy2 beating Mr. Chance by 15 pct points.

Obviously the odds are higher than 3:1 that Lexy2 will eclipse the 3-pct-point win eked out by the original Lexy in her single contest with Mr. Chance. The odds of that are, according to this simulation, about 5:1.

But what if we want to test the *relative strength *of the *competing* hypotheses (a) that “Lexy 2 is no better than the original 9001 series Lexy” and (b) that Lexy2 enjoys, oh, a “5-pct point advantage over Lexy” in a 75-decision term?

To do that, we have to figure out the relative likelihood of the observed data-- that is, the results reported in KBB -- under the competing hypotheses. Can we do that?

Well, consider:

Here I've juxtaposed the probability distributions associated with the "Lexy2 is no better than Lexy" hypothesis and the "Lexy2 will outperform Lexy by 5 pct points" hypothesis.

The proponents of those hypotheses are asserting that *on average* Lexy2 will beat “Mr. Chance” by 3%, Lexy’s advantage in her single term of competition, and 8% (+5% more), respectively, in a 75-decision term.

Those “averages” are means that sit atop of probability distributions characterized by standard errors of 0.09, which is by my calculation (corroborated, happily!, by the simulation) of the difference in success rates for both 0.72 and 0.75, on the one hand, and 0.60, on the other, for a 75-decision Term.

The *ratio of the densities* at 0.10, the observed data, for the "Lexy2 +5 " hypothesis & the "Lexy2 no better" hypothesis is 1.2. That's the equivalent of the Bayesian likelihood ratio, or the factor by which we should update our prior odds of Hypothesis 2 rather than Hypothesis 1 being correct (Goodman 1999, 2005; Edwards, Lindman & Savage 1963; Good 1985).

That's close enough to a likelihood ratio of 1 to justify the conclusion that the evidence is really just as consistent with both hypotheses –“ Lexy2 is no better,” and “Lexy2 +5 over Lexy.”

Is this “Bayes factor” (Goodman 1999, 2005) approach the right way to assess things?

I’m not 100% sure, of course, but this is how I see things for now, subject to revision, of course, if someone shows me that I made a mistake or that there is a better way to think about this problem.

In any case, the assessment has caused me to *revise upward* my estimation of the ability of Lexy! I really have no good reason to think Lexy isn’t just as good as Lexy2. Indeed, it’s not 100% clear from the graphics in KBB, but it looks to me that Lexy's 75% "prediction success" rate probably exceeded that of Lexy2 in 2002-03, the one year in which Lexy competed!

At the same time, this analysis makes me think a tad bit less than I initially did of ability of Lexy2 (& only a tad; it's obviously an admirable thinking machine).

Again, Lexy2, despite “outperforming” Mr. Chance by +10 pct over 60 terms, shouldn’t be expected to do any better than the original Lexy in any given Term.

More importantly, being only a 7:1 favorite to beat chance by at least a single decision, & only a 3:1 favorite to beat chance by 4 decisions or more (+5%), in an average 75-decision Term just doesn’t strike me as super impressive.

Or in any case, if *that* is what the political scientists’ “we’ve proven it: judges are ideological!” claim comes down to, it’s kind of underwhelming.

I mean, shouldn’t we see stronger evidence of an effect stronger than *that*? Especially for the U.S. Supreme Court, which people understandably suspect of being “more political” than all the other courts that political scientists also purport to find are deciding cases on an ideological basis?

It’s a result that’s sufficiently borderline, I’d say, to need help from *another* form of testing—like an experiment.

*No* empirical method is perfect. They are all strategies for conjuring observable proxies of process that in fact we cannot observe directly.

Accordingly, the only “gold standard,” methodologically speaking, is *convergent validity*: when multiple (valid) methods reinforce one another, then we can more confident in all of them; if they don’t agree, then we should wary about picking just one as better than another.

The quest for convergent validity was one of the central motivations for *our* study—discussed in my post “yesterday” to probe the “ideology thesis”—the political science conclusion, based on observational studies—via experimental methods.

That our study (Kahan, Hoffman, Evans, Lucci, Devins & Cheng in press) came to a result so decidedly unsupportive of the claim that judges are ideologically biased in their reasoning reinforces my conclusion that the evidence observational researchers have come up with so far doesn’t add much to whatever grounds one otherwise would have had for believing that judges are or are not “neutral umpires.”

But I'm really not sure. What do you think?

**Refs**

Ayres, I. How computers routed the experts. Financial Times ‘FT Magazine,’ Apr. 31, 2007.

Edwards, H.T. & Livermore, M.A. Pitfalls of empirical studies that attempt to understand the factors affecting appellate decisionmaking. *Duke LJ ***58**, 1895 (2008).

Edwards, W., Lindman, H. & Savage, L.J. Bayesian Statistical Inference in Psychological Research. *Psych Rev ***70**, 193 - 242 (1963).

Good, I.J. Weight of evidence: A brief survey. in *Bayesian statistics 2: Proceedings of the Second Valencia International Meeting* (ed. J.M. Bernardo, M.H. DeGroot, D.V. Lindley & A.F.M. Smith) 249-270 (Elsevier, North-Holland, 1985).

Goodman, S.N. Introduction to Bayesian methods I: measuring the strength of evidence. *Clin Trials ***2**, 282 - 290 (2005).

Goodman, S.N. Toward evidence-based medical statistics. 2: The Bayes factor. *Annals of internal medicine ***130**, 1005-1013 (1999).

Kastellec, J.P. & Lax, J.R. Case selection and the study of judicial politics. *Journal of Empirical Legal Studies ***5**, 407-446 (2008).

Katz, Daniel Martin and Bommarito, Michael James and Blackman, Josh, Predicting the Behavior of the Supreme Court of the United States: A General Approach (July 21, 2014). Available at SSRN:http://ssrn.com/abstract=2463244 or http://dx.doi.org/10.2139/ssrn.2463244

Long, J.S. *Regression models for categorical and limited dependent variables* (Sage Publications, Thousand Oaks, 1997).

Pampel, F.C. Logistic regression : a primer (Sage Publications, Thousand Oaks, Calif., 2000).

Shapiro, C. Coding Complexity: Bringing Law to the Empirical Analysis of the Supreme Court. *Hastings Law Journal ***60** (2009).

Shapiro, C. The Context of Ideology: Law, Politics, and Empirical Legal Scholarship. *Missouri Law Review ***75** (2010).

I have to say, as the day goes by I wonder if I'm giving Lexy2 her due... Given the high reversal rate in the US S Ct, there isn't a lot of room to show predictive proficiency. There's a necessarily going to be a high degree of variance in a 75-decision term-- just as there is in a 75-hand session of poker (or a 750-hand one, for that matter). If over time she is racking up an avg margin of 7.5 correct predictions per term over Mr. Chance, then that's something certainly.

Anyway, KBB are doing this right. I hope the observational scholars purporting to prove "ideology" is driving lower court decisions are putting their models to a test this genuine; it's what has to be done to earn the claim that a decisonmaking model is truly *predicting *rather than simply being *fit* to the data one is falsely purporting to "predict."