## Special feature: Insights on S. Ct. prediction models from someone who knows what he is talking about

*I did a couple of posts commenting (one here and another here) on the performance of computer models designed to predict the outcomes of Supreme Court cases. Taking the bait, someone who actually knows something about this issue felt obliged to step in and enlighten me, along with the 14 billion regular readers of this blog, 12 billion of whom rely exclusively on the site for information on all subjects. So read and learn! I've already updated my own views on the subject based on the analysis and will have something to say "tomorrow."*

**A Response: Computer Programs and Predicting Supreme Court Decisions**

**Justin Wedeking, University of Kentucky**

In Professor Kahan’s recent post (hereafter Kahan) he tackles two Supreme Court forecasting models. For clarity I’ll use the same labels. The first model – “Lexy” or “Lexy1” - refers to the forecasting challenge from the 2002 Term that pitted “machine” against legal experts (Martin, Quinn, Ruger, and Kim 2004; Ruger et al. 2004). The second model – “Lexy2”- is the recent (and still ongoing) effort by Katz, Bommarito and Blackman (2014).[1] The goal of this “reply” post is to offer some thoughts on Kahan’s critiques as well as on these forecasting models that will hopefully reshape how we think about Court forecasts.

There appears to be two main issues in Kahan’s post. First, Kahan’s primary concern appears to be that neither attempt at forecasting true, “out of sample” cases does “very well.” A related, and close secondary concern is that this failure to do well is problematic for various scholars’ claims made with respect to what he calls “the ideology thesis”- which can be thought of as the claim that judges’ decisions are driven more by their own ideology (or personal policy preferences) than “the law.” In perceiving a lack of evidence for “the ideology thesis” this is potential damning evidence for scholars who believe that ideology is a major factor in Supreme Court decision making. Namely, it suggests that we know relatively little about decision making.

With respect to Kahan’s first point, I do not have any strong disagreements but rather three points that suggest more caution is needed before forming conclusions about forecasting models. The rest of the post is divided into three sections:

- In section one, I identify and discuss different criteria for determining when we have a successful prediction;
- In section two, I take a closer look at what is being predicted (i.e., the dependent variable) and offer a few thoughts;
- In the third section, I close with some thoughts about the models and machine learning algorithms used in Lexy1 and Lexy2.

Regarding Kahan’s argument on the ideology thesis, I will save my thoughts for a later date.

**Keep reading (or else you will forever be denied enlightenment!)**

## Reader Comments (6)

This is really interesting, I still have to read it thru carefully.

I would like to suggest a clearer definition of the "null" or "naive" model, or maybe it's a new model, the "no specific information" (NSI) model.

The NSI model would use as its input past results only, justice by justice, with no specific information about the nature of the cases. Let's say we can specify each justice's response to a case as either "affirm" (1) or "reverse" (0). Each case will be described by a 9-digit string of 0's and 1's corresponding to each justices decision on the case. The input to the NSI model is all of the strings from as many previous cases as possible, for a given court makeup (let's say). The decision of the court as a whole can be calculated from these strings.

In other words, the NSI model has absolutely no specific information about any case other than the affirm/deny history and no specific information about the case at hand. This history may be very informative in itself, offering various correlations, etc., but in the end, as a predictive tool, it can do no better than to predict "reverse" as the decision of the court as a whole (assuming the court reversed more often than not in the above mentioned history)

We are not dealing with a naive person here, but rather one who has no specific information about any of the cases. To say that the NSI prediction makes "silly" errors presumes specific information about a case which allows one to define "silly". To say that all the errors will be of the same type even when we know they shouldn't presumes that we have enough specific information to define the meaning of "same type" and "know they shouldn't". When the accuracy of the "expert" predictions are no better than the "silly" errors of the NSI model, what are we really saying? It seems to me what we are saying is that we have nothing quantitative to show, but our results make us feel good, not silly.

In other words, I think Dan's criterion (the NSI prediction) is the right one to use, not the 50-50 model which presumes absolutely no information, not even the past history.

About accuracy, parsimony, etc.

I think predictive accuracy should be the bottom line. Parsimony is nice, and tends to be associated with predictive accuracy and a sense of understanding, but it should not be a goal in itself. In this situation, it is *extremely* easy to confuse parsimony with cultural identity confirmation, so unless we are very certain that our parsimonious model is true and free of cultural bias, it should be avoided like the plague. If our parsimonious model makes us feel good about who we are, be *very* suspicious. If the truth is parsimonious, then aiming for the truth will result in parsimony, if not, then why do it?

Predictive accuracy should be the goal, and this is not the same as accuracy in reproducing past results. The ability to reproduce past results says zero about the model's predictive accuracy. Ability to reproduce past results is a necessary but totally insufficient condition for predictive accuracy. A good test for predictive accuracy is to remove data points from past results, including any influence they have on your predictive algorithm, and test your algorithm for its ability to predict that missing data. Do this for many data point, pairs, triples, etc.

Its like having a bunch of data points that follow an exponential curve with a little noise. You try to fit those data points with a polynomial, and you keep jacking up the order of the polynomial (i.e. the number of adjustable constants) until you get a very good match to the data. The problem is that the polynomial oscillates terribly between the data points and is useless for interpolation or extrapolation. Remove a data point or two and the adjustable constants change dramatically, and the model cannot reproduce the missing data point. If your expert insight tells you it should be an exponential function, then you fit to an exponential function to a polynomial power and maybe you need 2 or 3 adjustable constants to reproduce the data and it passes the missing data test with flying colors. Oh, and parsimony is a happy result. If you have a deep unconscious emotional hatred of exponential functions, you may find another function which works ok, sacrifice some accuracy and go with it, because it seems so obvious and comforting, but it's not right.

@FrankL:

Keep the comments coming. I'm learning a lot from them & am sure others are getting (or if they get to them at some later time will get) similar value from them

@FrankL:

Thank you for commenting and sorry for the delay in responding, here are lengthy responses to your two comments.

First, re the “no specific information” (NSI) model: Putting aside for the moment the question of what your unit of analysis is, while the NSI model does not bring any case specific information (i.e., information that would distinguish one case from another), it does bring information in terms of how each justice and the Court voted. This is essentially the “naïve” model with a bit of injected awareness.

More to the heart of your comment regarding errors of the “same type” – I was referring to the confusion matrix in my post, where we *will* know (when we evaluate the predictions) if we have a “false positive” or a “false negative.” So, when all of the errors are of one type, which is what would happen if one were to employ the “naïve” or NSI model, we will have all “false positive” errors and zero “false negatives.” This degree of imbalance, in a context where the reverse-affirm distribution is approximately 65-35, can and does seem “silly” if we know that any reasonable prediction strategy should contain some false negatives.

I did not disagree with Kahan on invoking the “naïve” model as a threshold, as I said that is very common in the literature. However, I felt the “50-50” threshold got a bad rap and I think it can be appropriate under certain circumstances. In the end, I think picking the right criterion should match what your goal is along with considering what is your tolerance for threshold of “victory” in comparing one model over another.

Second: regarding accuracy versus parsimony: We are in agreement that accuracy is the bottom line. I said that as well in my post, and since that is the name of the game in forecasting, it will always be that way. My point was perhaps more subtle, but nonetheless still very important. And I do not think one has to bring in cultural identity confirmation when thinking about parsimony. I can have two competing parsimonious models (irrespective of what the cultural identity and biases are), evaluate their accuracy, and then use the one that is more successful, even if it goes against my cultural identity. Let me explain further why parsimony is and can be a goal.

Why parsimony? To answer this, I am assuming forecasters want three things: (1) to be accurate; (2) to know why forecasts are accurate (i.e., understand why they are accurate); and (3) to be able to explain to others coherently how accurate predictions are generated.

I don’t think I will get much disagreement on the first assumption. With respect to the second assumption, I think this is pretty reasonable, but it merely rules out predictive strategies that are not based on some sort of systematic explanation that is truly related to the phenomenon we are trying to predict. For example, it would rule out the prediction strategy of predicting “reverse” every time Justice Scalia generates three laughs or more from the audience at oral argument. That strategy may be highly accurate, but is likely only spuriously related to the underlying “signal” that is driving the reverse-affirm decision. The point being, at some point, using the “three laugh” prediction strategy will lead us astray.

The third assumption is the difficult one, in my opinion, but it is necessary if our forecasts are to help us better understand what is going on with the process. If we care about that, then parsimony has to figure into our prediction strategy at some point. This third assumption is one in which Lexy2 has a very difficult time satisfying. With over 90 predictors, it is hard to explain to others what is going on or even if it is capturing the underlying signal. This is not to say their strategy is incorrect, it is perfectly fine and reasonable because they are explicit about emphasizing their goal of accuracy. But it also means that they give up on parsimony as one of their goals. In contrast, Lexy1 was explicit in that it tried to achieve both parsimony and accuracy when competing against experts, and to some extent they made progress towards both goals. However, when we shift the threshold (e.g., moving from “experts” as the comparison to the “naïve” model), then we see that their progress towards the accuracy goal is less realized. In defense of the Lexy1 authors, it was my reading of their study that they wanted to “beat” the experts while having a parsimonious model. Thus, I think both Lexy1 and Lexy2 are “good” models in the sense that they achieved the goals they were explicitly set out to accomplish. Whether it is fair to evaluate them on additional criteria imposed after the fact is a different question.

To continue on with the importance of this third assumption, the analogy I would use is one of weather forecasters. As many others have noted (e.g.,. Nate Silver) weather forecasters have improved substantially over the years and now have a high accuracy rate (for things like predicting temperature and rainfall, etc.). But when we, as consumers of weather info, get the forecast, we do not necessarily care how the forecaster came up with the forecast (yes, we listen to talk about jet streams and high/low pressure fronts, but that is only the veneer). As *consumers* we only want to know whether we have to bring our umbrella to work. Conversely, if we were all weather forecasters, we would want to know what is going on with the prediction generating process. In the end, parsimony is just one goal that forecasters can value under certain circumstances. While parsimony and accuracy do not have to result in a tradeoff, it often times does, and that was the case in comparing “Lexy1” and “Lexy2.”

A few other related points.

Some of this is dependent upon what algorithm you choose and how susceptible it is to overfitting, but one common “mistake” with forecasting is a tendency to keep adding predictors to the model because it improves accuracy on the “test” data. This approach comes with the risk of setting yourself up for an “overfit” model that may ultimately result in poor accuracy when tested on “out of sample” data. Parsimony helps keep the focus on trying to understand what is the main driving force behind the process. Overfitting a model makes us aware of the irony of forecasting- that striving for the goal of accuracy can sometimes result in you being further away from your goal than when you started.

Next, with respect to the value of parsimony, compare the 2002 accuracy rates of “Lexy1” versus “Lexy2”- and you will see that they are not that much different. And Kahan’s previous post on this makes this point. But where the two models differ drastically is in one’s ability to explain or tell a coherent story as to *how* and *why” we arrived at the prediction. That is the value of parsimony.

With respect to your second paragraph in your second comment, I am not sure I follow your point. You begin the paragraph by saying “the ability to reproduce past results says zero about the model’s predictive accuracy” but later in the paragraph say “a good test for predictive accuracy is to remove data points from past results…and test your algorithm for its ability to predict that missing data.” I’m not sure you can have it both ways. Regardless, what it might be helpful to know with respect to machine learning approaches to forecasting, which is what both Lexy1 and Lexy2 are, the process you describe (of removing past data points and predicting them)- that is essentially what forecasters are doing when they are examining their “test” data (after having “trained” the model) – this is the cross-validation process. Once forecasters are satisfied with this part, they then move on to “out of sample” data (i.e., making predictions about what will happen tomorrow).

One last point re researcher bias: fitting an exponential model is just one of many possible algorithms you could use, and the goal of the cross-validation process is to remove any emotion the researcher might have. It does this because the model is learning from the data, not the researcher per se. But your point is still correct in the sense that the cross-validation process, which can take many different forms, can still allow biases to creep into the model.

@Justin Wedekind - Hi, and thanks for your response. Please forgive the long winded response, but I think it touches on some good points.

Regarding the "NSI" model and models in general, I'm not sure what you mean by "unit of analysis", but I have been thinking of the decision of the court for a given case as being represented by a 9-digit binary string (D), each digit representing the decision of a particular justice, with 1 representing "affirm" and 0 representing "reverse". I know this is a simplification, no room for recusals, more nuanced decisions, etc., but I just want to see how far it takes me. It seems to me that the output of *any* model for a given case can be expressed as a list of 2^9=512 probabilities, p(D), each giving the probability that a particular 9-digit string will be the outcome of the case. The probabilities, of course, sum to unity. The probability of whole-court affirmation or reversal can be calculated from p(D), as well as various correlations, etc. When you say this is the naive model with a bit of injected awareness, I would say that "bit" is the maximum amount of non-specific awareness attainable from the cases considered.

For the coin-flip model, each of these probabilites is equal to 1/512. The NSI model is built up from n previous cases and each probability is equal to the number of times the particular case outcome occurred divided by the total number of cases. It is not specific, so it does not change when it is applied to any out of sample case. There is a lot of information there, none of it case-specific, but still, as a predictive tool, it yields simply the ratio of whole court reversals divided by the number of cases as the probability that the whole court will reverse on any future case. The coin flip model yields 1/2 for this probability. An "expert" model will generally deliver a different list of 512 probabilities for each case based on some specific information about the case.

I see what you are saying about errors of the same type, and I agree that these results are "silly", but I am trying to understand things in an information-theoretic sense, and maybe I get OCD about what information comes and goes where. Characterizing the confusion matrix as silly is using "expert" information, I just want to keep that clear in my mind.

I am working on the assumption that a successful prediction of the whole court decision is the bottom line, that it constitutes the only "victory" available, it is "what we are betting on". I believe there should be a single-minded emphasis on raising accuracy levels *for out of sample cases*. If we overfit, trying to raise accuracy on the training cases, our single-minded emphasis will declare that model a failure. If we have a non-parsimonious model which beats a parsimonious model on out of sample cases, it is to be preferred, but I would want to analyze that model to perhaps find out a parsimonious reason why it was so good. Is the problem itself complex, requiring a complex model, or is it simple and our model needlessly complex? I would not be as interested in finding out why the parsimonious model was less useful.

From an information viewpoint, any model which cannot beat the NSI model with respect to the whole-court decision is delivering no more predictive information than the NSI model. If we change the definition of victory to include other kinds of predictions regarding the decision, then maybe expert models will do much better. For example, we could develop a non-specific model in which each justice had a 60% probability of voting for reversal, and each justice's decision was independent of any other. This will give the 512 probabilities and will yield the same 70 percent probability of whole-court reversal. But if you ask about correlations between the justices' decisions, it would be a very bad predictor, while the NSI model would be much more informative. The same would be true if we awarded a point for every justice's decision that was correctly predicted as the measure of victory.

Regarding the second paragraph on the fitting to data, perhaps it wasn't too clear, but the polynomial fit was just an example of overfitting and the use of cross validation to reveal the overfitting. The exponential fit was an example of choosing the "right" model, and how cross validation validated the model and how the underlying parsimony of the process was reflected in the model as a consequence. The big question is, how do we discover that "right" exponential function if the candidates for the fitting function carry a lot of cultural identity baggage? This is what bothers me regarding the SCOTUS predictions.

Regarding this problem of bias and parsimony, as a theoretical physicist, I value parsimony very highly, but I think of it as something to be discovered, rather than a process of picking the most victorious one out of a list of my favorite candidates. One thing I have done is to search for tendencies of justices to vote as a bloc when there is disagreement. (This is a type of correlation about which the NSI model has something significant to say.) My general attitude has been characterized as cynical by those of my friends who value consensus more than I do, so if I were building a specific model, perhaps I would be prone to overvalue the parsimony of these bloc-forming tendencies while, I believe, they would be prone to undervalue it. This is my suspicion regarding parsimony. I prefer an analytic method which allows me to discover parsimony, rather than hoping that I can overcome my bias-induced blindness and equitably test some parsimonious candidates that I don't like. I would be more inclined to trust someone else's results if they were arrived at in the same spirit. Dan is good at this (but, of course, not perfect, in my mind), and he gets a lot of flak as a result.

--------------- Information Theory and SCOTUS prediction

To summarize, each decision by the court (D) can be represented as a 9-digit binary string, each whole-court decision (d) as a 1-digit binary string. Predictive information (W) is associated with each case, and it is a binary string of m digits. W may be the same for every case, in which case it is non-specific, or it may change for each case, in which case it is specific. The mutual information I(D,W) or I(d,W) will give the amount of predictive information in W (in bits) that can be used to predict D or d. A function F(W) must be devised to actually deliver the prediction and unless the function is perfect, there will be more loss in predictive ability. It would be a good idea to try to compare various estimates of the mutual information in the Lexy's to determine how much predictive information is availiable, and then how well the predictive function is delivering this information.

The information entropy of the 512 probabilities for a given case can be calculated and it measures how "uncertain" or "tentative" a model is. The coin-toss model is maximally uncertain, the entropy is 9 bits of "missing information" or uncertainty. A model which makes a determinate prediction, with 511 of the 512 probabilites equal to zero and the remaining one equal to unity, is minimally uncertain, with an entropy of zero. The NSI model applied to the 2010 court data yields an entropy of about 3.7 bits. The entropy tells you nothing about whether the model is useful or not, so its not too interesting.

The interesting part comes when you have not only a set of case results (D), but a set of allegedly predictive information (W) supplied by "experts". The predictive information can be expressed as a binary string. Like the game of 20 questions, it can be expressed as a yes/no answer to a series of, say, m questions. I doubt that real numbers are a necessary part of the allegedly predictive information, but rather can be replaced by rational numbers, even machine-size, and a rational number is the ratio of two integers, and an integer can be expressed as a binary string. The experts then create a function of W which hopefully is making as full use as possible of the predictive information in W.

So now we conceptually have a population of cases, each having an associated D,W pair. There is a probability associated with each D,W pair. If we choose to use, say, the last years 100 cases, then we are effectively sampling the population 100 times. Regarding the population itself, if we knew the probability of every conceivable D,W pair, we could calculate the mutual information I(D,W). Since we expect W to contain many more bits than D (9 bits), the mutual information will yield the number of bits of information that W has in common with D. If that turns out to be 9 bits, then W will contain enough information to deliver a perfect prediction of D. If it turns out to be zero bits, then W is totally useless as a predictive tool. Let d be the whole-court decision, 0 for reverse, 1 for affirm. Then we can have a population of d,W pairs, and the whole discussion above can be made for this situation as well.

There are two steps to prediction - getting a W that contains enough information to deliver a good prediction, and then finding the function of W (F(W)) that actually delivers that prediction. That choice of function has to be sure not to be overfitted, etc. I wonder if the poor results of the Lexy's is the result of insufficient information in the predictor information W, or a poor choice of the function F(W) chosen to deliver the prediction? Probably a combination of both, but an estimate of the mutual information would give a good idea where the problem is.

A further problem is that we don't know the probability of the D,W pairs, we can only estimate them from our 100 samples. So this will involve error bars on everything, and I have not even considered that.

The W does not have to be information specific to a case. In fact, we can set W equal to the 9x100=900 digits of the 100-case sample data, to recover the NSI model. If we are considering whole-court decisions, then there will be a set of 2^901 possible d,W pairs, of which only two are non-zero. If Wo is the 900 digit string of previous case decisions, then only d,W=1,Wo and d,W=0,Wo will have a non zero probability. 1,Wo will have a probability equal to the number of affirmations divided by the number of cases, and 0,Wo will have 1 minus that probability. Choose the one with the highest probability and go with that as your prediction for every case. Indications are that 0,Wo will be the highest (65-70%) resulting in "always reverse" as the prediction. If we are considering the individual decision vector (D), then a lot more things can be discovered.

I don't know if a set of past cases is used in the Lexy's as part of W. If it is, and the Lexy's yield the same reversal rate as the NSI model, it shows that either the predictor function of W is bad or that all the specific information contained in W is useless for predicting d. I(d,W) needs to be checked. W probably does contain some information for predicting the individual decisions (D), so the mutual information I(D,W) will not be zero. Hopefully it is better than the NSI model. If W does not contain explicitly the 100 case results, perhaps including it would significantly raise the value of I(D,W), pointing to the need to include it (along with a predictive function of W doesn't screw things up)

Errata - "When you say this is the naive model... " should read "When you say the NSI model is the naive model..."