## Using likelihood ratios -- not pee values -- to weigh the evidence on judges & motivated reasoning

But here's another excerpt.

This one shows how we supplemented our use of conventional "statistical significance"/NHT testing of the study results with use of Bayesian likelihood ratios.

We did use former, but I think the latter are more useful generally for conveying practical strength of evidence & also for assessing the relative plausibility of *competing* hypotheses, an objective central to empirical inquiry for which NHT/"statistical significance" is ill-suited (see Goodman, S.N., Introduction to Bayesian methods I: measuring the strength of evidence, Clin Trials 2, 282 - 290 (2005); Edwards, W., Lindman, H. & Savage, L.J., Bayesian Statistical Inference in Psychological Research., Psych Rev 70, 193 - 242 (1963)). Anyone who disagrees that Likelihood ratios are cool is a Marxist!

Oh, BTW: "IPCI" refers to "identity-protective cognition impacat," which is the average percentage-point difference in the probability that a subject type (judge, lawyer, student, member of the public, or house pet) would be to find a statutory violation when doing so affirmed rather than defied his or her cultural worldview.

* * *

** c. Judges vs. members of the public using Bayesian methods.** As an alternative to assessing the improbability of the “null hypothesis,” one can use Bayesian methods to assess the strength of the evidence in relation to competing hypothesized IPCIs. Under Bayes’s Theorem the

*likelihood ratio*reflects how much more consistent an observed outcome is with one hypothesis than a rival one. It is the factor in proportion to which one should adjust one’s assessment of the relative probability (expressed in odds) of one hypothesis in relation to another.

Imagine, for example, that we are shown two opaque canvas bags, labeled “*B _{1}*”and “

*B*,” each of which is filled with marbles (we use “canvas bags” for this example in anticipation of the reasonable concern that Bayes’s Theorem might apply only to marble-filled

_{2}*urns*). We are not told which is which, but one bag, it is stipulated, contains 75% red marbles and 25% blue, and the other 75% blue and 25% red. We are instructed to “sample” the contents of the bags by drawing one marble from each, after which we should make our best estimate of the probability that

*B*is the bag containing mostly

_{1}*blue*marbles and

*B*

_{2}_{ }the one containing mostly red. We extract a blue marble from

*B*and a red one from

_{1}*B*.

_{2}Bayes’s Theorem furnishes logical instructions on how to use this “new evidence” to revise our estimates of the probability of the hypothesis that *B _{1}* is the bag containing mostly

*blue*marbles (and hence

*B*mostly red).

_{2}*If we*

*assume*

*that that hypothesis is true*, then the probability that we would have drawn a blue marble from

*B*is 3/4 or 0.75, as is the probability that we would have drawn a red marble from

_{1}*B*. The joint probability of these independent events—that is, the probability of the two occurring together, as they did—is 3/4 x 3/4 or 9/16. If we assume that hypothesis “

_{2}*B*

_{1}is the one that contains mostly blue marbles”

*is false*, then the joint probability of drawing a blue marble from

*B*followed by a red marble from

_{1}*B*would be 1/4 x 1/4, or 1/16. Other possible combinations of colors could have occurred, of course (indeed, there are four possible combinations for such a trial). But if we were to repeat this “experiment” over and over (with the marbles being replaced and the labels on the bags being randomly reassigned after each trial), then we would expect the sequence “blue, red” to occur

_{2}*nine times more often*when the bag containing mostly blue marbles is the one labeled “

*B*” than when it is the bag labeled “

_{1}*B*.” Because “blue, red” is the outcome we observed in our trial, we should revise our estimate of the probability of the hypothesis “

_{2}*B*contains mostly blue marbles” by a factor of 9—from odds of 1:1 (50%) to 9:1 (90%).

_{1}We can use precisely the same logic to assess the relative probability of hypothesized judge and pubic IPCIs. In effect, one can imagine each subject-type as an opaque vessel containing some propensity to engage in identity-protective cognition. The strengths of those propensities—the subject types’ “true” IPCIs—are not amenable to direct inspection, but we can sample observable manifestations of them by performing this study’s statutory interpretation experiment. Calculating the relative likelihood of the observed results under competing hypotheses, we can construct a likelihood ratio that conveys how much more consistent the evidence is with one hypothesized subject-type IPCI than with another.

Figure 8 illustrates the use of this method to test two competing hypotheses about the public’s “true” IPCI: that members of the public would be 25 percentage points more likely to find a violation when doing so is culturally affirming, and alternatively that they would be only 15 percentage points more likely to do so. To make the rival hypothesis commensurable with the study results, we can represent each as a probability distribution with the predicted IPCI as its mean and a standard error equivalent to the one observed in the experimental results. Within any *one *such distribution, the relative probability of alternative IPCIs (e.g., 15% and 25%) can be determined by assessing their relative “heights” on that particular curve. Likewise, the relative probability of observing any particular IPCI under alternative distributions another can be determined by comparing the *ratio of the heights* for the probability density distributions in question.

The public IPCI was 22%. The probability of observing such a result (or any in close proximity to it) is *eight times more likely* under the more extreme “public IPCI = 25%” hypothesis than it is under the more modest “public IPCI = 15%” hypothesis (Figure 8). This the Bayesian likelihood ratio, or the factor in proportion to which one should modify one’s assessment of the relative probability that the “true” public IPCI is 25 as opposed to 15 percentage points.

We will use the same process to assess the weight of four competing hypothesis about the vulnerability of judges to identity protective cognition. The first is that judges will be “unaffected” (IPCI = 0%). This prediction, of course, appears similar to the “null hypothesis.” But whereas “null hypothesis testing” purports to specify only whether the null hypothesis can be “rejected,” Bayesian methods can be used to obtain a genuine assessment of the strength of the evidence in support of there being “no effect” if that is a genuine hypothesis of interest, as it is here. The remaining three hypotheses, the plausibility of which will be tested *relative* to the “IPCI = 0%” hypothesis are that that judges will be “just as affected as the public” (IPCI = 22%); that judges will be moderately affected (IPCI = 10%); and that judges will be affected to only a comparatively mild degree (IPCI = 5%).

The results are reflected in Figure 9. Not surprisingly, the experimental data are much more supportive of the first hypothesis—that judges would be unaffected by the experimental manipulation—than with the second—that they would be “as affected as much as the public.” Indeed, because the probability that we would have observed the actual experimental result if the latter hypothesis is true are astronomically low, there is little practical value in assigning a likelihood ratio to how much more strongly the evidence supports the hypothesis that judges were “unaffected” by the experimental manipulation.

Of course, members of the public were influenced by their cultural predispositions to a strikingly large extent. To learn that the evidence strongly disfavors the inference that judges are *that* biased does not in itself give us much insight into whether judges possess the capacity for impartial decisionmaking that their duties demand. It was precisely for that reason that less extreme IPCIs were also hypothesized.

Even those predictions, however, proved to be less supported by the evidence than was the hypothesis that judges would be unaffected by identity-protective reasoning. The evidence was *20 times* more consistent with the “judge IPCI = 0” hypothesis than the “judge IPCI = 10%” hypothesis. The weight of the evidence was not as decided but still favored—by a factor of about three—the “judge IPCI = 0” hypothesis over the “judge IPCI = 5%” hypothesis (Figure 9).

## Reader Comments (13)

@Dan - You know my mindset is treating each justices n decisions set as an n-dimensional vector V, 1 for affirm, 0 for reverse. Then I look for the distance between two justices vectors V1 and V2, and try to come up with, say, a 1 or 2 or 3 dimensional space that positions the justices such that these distances are roughly preserved. There is clearly a bunching in 1-dimension, reflecting what is usually seen as the "left/right" division of the court. 2 dimensions gives some more nuanced bunching along another axis, and that's about it. 3 dimensions gives little more information. This is pretty much independent of the analysis method used (e.g. Factor analysis vs My Analysis, etc.)

I am struggling to understand what you wrote, trying to shoe-horn it into my mindset. If the judges are unaffected by identity-protective reasoning (IPR), then the left/right bunching must be due to something else. I mean, if the justices were unaffected by anything but the "truth", then every decision would be unanimous.

I guess I am not clear on what the marbles are in your example.

@FrankL

Good point-- we discuss this in paper.

Judges don't have to agree on outcome -- just not disagree *b/c* of cultural values. Analogously, 2 open-minded evidence evaluators might disagree; uninfluenced by ideological bias, they still might disagree b/c of different priors, imprecision of their own capacity to assess probative weight etc.

In fact, there's not a lot of variance in judges' views in this problem. But again, the thesis isn't that professionals will always agree; it is that they will focus on what is "pertinent" to their form of expert decsionmaking and disregard what is not, or at least ignore the non-pertient innfluences associated with identity protection.

As for "diemnsions," take a look at paper. There were 2 problems, each of which created motivated-reasoning pressure that was most extreme between 2 of 4 "types" in the cultural cogniton 2x2. For that purpose, polarization is 1-dimensional in each problem.

We could have used left-right.

We would have observed polarization among members of public, although not as much, b/c left-right, while correlated with our measures, is not nearly as discerning of variance.

But we would have observed none among judges.

The result is admittedly at odds with conclusion from observational studies finnding "ideology" explains judicial opinions. The paper was motivated in part by methodlogical criticisms of those studies.

IPCI is a composite index (an average, in effect) of effects from the 2 experiments.

"Ratio of marbles" in bags is equivalent to "magnitude of IPCI" or "propensity for bias" in types of decsionmakers. The outcome for public is consistent with hypothesize of "lots of blue to red" (whatever) or "big IPCI" in public; outcome for judges is highly inconsistent with hypothesis "lots of blue to red" or "big IPCI"

@FrankL:

Another thing occurs-- something we should add to paper.

I don't think the results of this experiment *necessarily* contradict any model that predicts case outcomes on basis of covariance of S Ct or other appellate court judges.

But the question then is: what does that covariance reflect? Is it a latent "ideological predisposition" that motivates the judges to rely on values extrinsic to law? Or a latent "jurisprudential orientation" that motivates judges to perceive similar legally *relevant* consideratoins as being of dispositive weight?

This is one of my concerns about the observational study models. They are confounidng "ideology" -- or a disposition to rely on extralegal considerations -- with "jurisprudence" -- a disposition that might vary across judges but that affects only *which* legally *relevant* considerations matter most to them.

The experiment was designed to "tempt" subjects w/ considerations extrinsic to law but very pertinent to their ideologies or cultural outlooks.

Ordinary folks too the bait. Not the judges.

@Dan - Then why do the judges separate according to common notions of left/right?

Am I right here? - You are saying different justices put different weights on legally relevant considerations, and those weights tend to reflect their "ideology", causing the familiar separation. You are saying the study shows that they are hardly at all influenced by legally irrelevant considerations, unlike the general public. Is that a fair summary?

Regarding the marbles, ok, ratio of marbles is IPCI - propensity for bias. So when you pull a marble out of a bag, its "color" is analogous to "bias". You say "but we can sample observable manifestations of them (color ratios of marbles, or IPCI) by performing this study’s statutory interpretation experiment." but after that, it seems you present the results of the experiments. I am not clear about what that experiment is. In the marble case, the experiment is that you pull a marble out of a bag and determine its color. I suppose you pull a person out of the justice bag (only 9, pull them all) or the general public bag and determine their bias? How is that done?

@FrankL:

1. In this study, the judges don't polarize. They don't split left-right (one can infer from correlations between worldviews, which is what we measured, and political outlooks).

2. The observational data studies that scholars purport to adduce that judges divide politically is very weak: it reflects outcome measures that confound extralegal "ideological" considerations with jurisprudential differences intrinsic to law, rests on serious selection biases; and turns out to have little to no predictive power. In my last msg I suggested that one might see judges who one characterizes as "left" disagreeing with ones who are "right" on issues of law that *legitimately* require decisionmakers to make value judgments; but it would be misleading to describe that as judges relying on "ideology, not law" -- although that is in fact a mistake that many political science researchers don't seem to care about.

3. If you look at the paper or the previous post, you'll see that what we do is measure how much subjects of particular worldview are to favor finding a legal violation when doing so affirms their values rather than disappoints their values. The strength of that tendency is analogous to the "proportion of blue marbles" in the bag; our experimental sampling supports inference that the strength is high for members of the public but low (or zero) for judges.

I don't get it. Justices do polarize - knowing the vote of Anton Scalia, you can predict significantly better than chance (especially if the decision was not unanimous) what the vote of Clarence Thomas will be (agree with Scalia), and what the vote of Elena Kagan will be (disagree with Scalia).

Yes, the justices have "wiggle room", no case is cut and dried or the decisions would always be unanimous. They can (properly) weigh different factors in the case differently, and when they do, Scalia and Thomas tend to agree on the weighting schedule, whatever it is, while Kagan and Sotomayor tend agree with each other, but disagree with Scalia & Thomas.

The general public, I guess you are saying, make choices that are based on legally irrelevant considerations, the justices do not.

This is all said without characterizing the weighting choices as "left" or "right".

@FrankL:

1. On predicting better than chance, consider how well Lexy does. I think the observational literature has clearly *not* been able to show that its understanding of "ideology" helpfully predicts outcomes, even in S Ct!

2. The claim "judges disagree b/c they disagree about the law" would be uninteresting, right? The question is whether the claim that they disagree about law *b/c of ideology* is actually a different claim from that. The "ideology thesis" confounds disagreements about law arising from jurisprudential visions that have relationship to political theories, on one hand, w/ disagreements about not-law ideological consideations, on other. The paper discusses this; the experiment is designed to illustrate the difference. Read pp. 6-10 & 21-24 & tell me if that helps sharpen the point; it is an important one.

3. The "especially in nonunanimous cases..." is great example of how the political science literature selects on dependant variable. Yes, if one explains the cases that fit one's explanation & ignore the ones that don't, you get a high "R^2."

@Dan - Do I have this right? We are asking two different questions here. You are asking "Do judges determine outcomes based on extra-legal considerations reflecting their cultural identity group?" and the answer is "the results are consistent with the hypothesis of no".

My question is "when the judges *do* disagree, do they disagree in a predictable manner?" and the answer I get is *not* consistent with the hypothesis of no. Roughly speaking, they form two groups in which the members of each group agree with each other and disagree with the others.

When I see Scalia and Thomas in one group and Sotomayor and Kagan in the other, I think to myself that it reflects the usual left/right division and that the reasons for the division are probably due to their membership in a cultural affinity group. This sort of division occurs year after year. This does not imply that they are making decisions based on extra-legal considerations. They have wiggle room inside the constraints of legal considerations, maybe they wiggle towards their favorite group. It does not imply that they always do that. There is a definite possibility that in the vast majority of unanimous decisions, the judges had wiggle room yet refused to wiggle towards their cultural identity group. My whole thinking in this paragraph is not supported or even approached by my analysis of the data, its not scientific, just musing. Your data is more discerning.

I would be curious to know, using your data, what the situation is when judges disagree on a particular case. *When* they disagree, do they divide in a predictable way? In particular, do they divide according to their cultural predispositions in a statistically significant way?

With regard to Lexy, the same applies. The question is "can you predict each SCOTUS justice's decision on a particular case?". The fact that Lexy gets it right 70% of the time, as does the simple "always reverse" model while the experts get it right 59% of the time certainly shows the experts have dropped the ball. The question of correlations in the case of disagreement would be interesting to answer. How did the experts and Lexy do in predicting correlations, given that there is a split decision? The "always reverse" model fails here.

FrankL -

==> "I would be curious to know, using your data, what the situation is when judges disagree on a particular case. *When* they disagree, do they divide in a predictable way? In particular, do they divide according to their cultural predispositions in a statistically significant way?"

What I still get hung up on was how the two camps on SCOTUS seemed to switch views on states' rights in Bush V. Gore - as an example of what you're outlining here. But it may be an outlier.

Hi Joshua - Yes, if we were to look for politically motivated reasoning, that would be the first one I would look at. Dan's question would be "did they step out of legal bounds?" and mine would be "given that they disagreed, were the members of the two camps predictable?". I have no answer to the first but given Dan's results, the answer is likely not, but the answer to the second is obviously yes.

That case annoyed me in so many ways. The hypocrisy of both sides was so front and center. In the end it was decided on the basis of "practical tie, favor of the people with the most political power". I guess it could be worse. If the roles and power of the democrats and republicans had been reversed, I am certain each party's arguments would have reversed as well, and the democrats would have won. An outlier? I think it was an outlier of a case, and given that, the results were not enormously surprising.

@FrankL:

Your suggestion that I am focusing on whether "law or extralegal" factors influence judges & you "what explains differences among judges when they disagree" is very helpful, yes.

I agree it's itneresting to be able to explain systematic disagreement as well as convergence (actually, one can do one w/o other!).

But if the hypothesis for "disagreement" is "ideology," there are two *very important* issues to be concerned about in the "ideology explains disagreement" project, both of which I've alluded to.

The first is what "counts" as "ideology" for purposes of the hypothesis.

The scholars who want us to believe they are saying something importnat when they say "ideology" explains variance in judicial decisoins *purport* to be identifing "ideology" as an influence extrinsic to and oppoosed to "the law."

But in fact their measures aren't suited for that; they are "counting" as ideology differences ones that in fact reflect differences in the jurisprudential theories essential to animate highly general concepts like "free speech," "unfair restraint of trade," "equal protect," etc.

The differences in jurisprudential theories in question for sure have points of contact w/ right-left political outlooks and are likely for that reason to correlate w/ valid measures of the political outlooks of the judges who espouse them. But in those instances, the "disagreement" that is being explained is *not* one relating to resort to extralegal considerations.

In effect, the theory amounts to saying, "judges disagree about the law b/c they disagree about the law." That is not (or not very often) a very interesting thing to say. We know it already; the judges in fact *tell us* that that is what's going on in the reasoning they set forth in their opinions. And it most certainly *does not* support carrying on as if "judges are politicians in robes" etc.

The experiment we did was designed to see what happens when one is very clear to distinguish extra-legal ideological factors from ones intrinsic to law. Well .. we see a very remarkable resistance to the impact of such factors.

The second very very very serious concern for the "ideology explains disagreement" accoutn is selection on the dependent variable. If that hypothesis means ideology (however definied) *causes* disagreement, then it is is in fact a logical mistake to exclude from the sample cases in which there is agreement. The reason is simple: if the hypothesis is that ideology "caues" disagreement, then the existence of agreement in cases in which ideology predicts disagreement *falsifies* the hypothesis.

Clearly, many many scholars who purport to be showing us evidence that "ideology" explains cases are making this mistake. They are ignoring that "liberal" and "conservative" judges agree in many many many cases that by the "ideology thesis" scholars' own outcome-classification scheme pit "liberal" vs. "conservative" outcomes against each other. If judges were deciding on the basis of "ideology" as those scholars understand it, there wouldn't be agreement in those cases. So whatever it is the scholars are telling us by confining their attention to "disagreement" cases, it's *not* what factors *cause* outcomes generally.

I don't myself see our study as "explaining" the S Ct in particular (and for sure not trying to explain whether any particular case, like Bush v. Gore, was decided either correctly or free of the influence of rank, non-legal ideological considerations).

I would say that we should expect a disproportionate number of their cases to pit opposoing *jurisprudential* visisons against one another-- so if one makes the mistake of conflating extralegal ideological & jurisprudential influences, one will definitely be able to inflate one's invalid stock of evidence the most effectively by look at the S Ct.

But the "ideology thesis" is a general one and is used to explain patterns of decisoins in all manner of court.

But the fact that the proponents of the "ideology thesis," in addition to making the sorts of mistakes I'm describing, *haven't* been able to predict outcomes in the S Ct -- the tribunal most favorable to their position -- better than chance w/ their "ideology" models is pretty darn telling.

that said, I think Lexy2 is not bad-- indeed, reflects an admirable and serious degree of reflection & craft that is missing from most of the "ideology thesis" scholarship.

The one thing to realize, of course, is that Lexy2's "ideology" related variables -- in addition to confounding extra-legal and jurisdprudential influences (my first point) -- actually account for only a *fraction* of the predictive value of the model. the model uses other things -- like what "lower court" (some get reversed more often than others) & "month of decision" in the Term (the reversal rate varies over course of term etc). To the extent that *those* variables are contributing to the predictive success of the model, they are *contradicting* the "ideology hypothesis."

If one wanted, one could treat this whole exercise as one about what counts as *valid* empirical scholarship. Because there is a lot of bad bad bad scholarship being produced by those who claim "ideology" drives judicial decisoinmaking.

But in fact the question whether "ideology" drives judicial decisionmaking is very important as a practical matter.

So let's use valid methods -- ones that appropriate specify the variables of interest in relation to what various hypotheses are said to mean, & avoid obvious sampliing biases (of which "selecting on dependent variable" is only one that is hobbling this scholarship) -- and figure out what's what.

And certainly, let's not invoke particular cases or "what everyone knows" or "common sense" or "personal experience" etc. as respones when someone points out a defect in methods.

For sure, I'm

nottalking about you here. You both know what you are doing and engage in open, critical reflection about it.I'm talking about what proponents of the "ideology thesis" routinely do. When they fall back on this silly defensive way of responding to criticism (as they do all the time, as I was reminded in a workshop a couple days ago), they make the embarrassing mistake of treating the very casual observations that empriical evidence is supposed to test as a remedy for defects in their empirical evidence!

@Dan - I think I understand and agree with everything you say, up to:

"if the hypothesis is that ideology "causes" disagreement, then the existence of agreement in cases in which ideology predicts disagreement *falsifies* the hypothesis.

Yes, but assuming that the justices are not influenced by extra-legal factors, (i.e., they are constrained to decide within legal bounds), we have to worry about whether "predicts disagreement" means a disagreement while remaining within those constraints or not.

It's not clear to me, when you describe the work of many many scholars, whether or not they are operating under this constrained definition. If they are not, then yes, assuming as you say, that the justices are so constrained, they will get it wrong.

You wrote:

"But the fact that the proponents of the "ideology thesis," in addition to making the sorts of mistakes I'm describing, *haven't* been able to predict outcomes in the S Ct -- the tribunal most favorable to their position -- better than chance w/ their "ideology" models is pretty darn telling."

I agree, and perhaps their mistake was not using the above constrained definition of disagreement. But to predict an outcome using this definition requires a detailed knowledge of the law. Perhaps some of their mistakes were that they used the constrained definition of disagreement, but did not understand the law as well as the judges. I don't know.

I accept your thesis that judges mostly operate under the constraint, and I don't see it to be in conflict with the idea that when there is an opportunity for constrained disagreement, the disagreement will happen and be such that two "camps" are identifiable in a statistically significant way.

I then jump to the unscientific but obvious conclusion that the camps are what are generally identified as liberal/conservative.

I think your data can go far in shedding some scientific light on that conclusion. You could look at the cases in which there is disagreement, and see if the decision of a judge then correlates to their position on the hierarchy/egalitarian-individualism/communitarian space. (I forgot the short word for that space.)

@FrankL:

I think we are in agreement, then!

Predicting outcomes is one project -- an intersting one that might help to figure out if machine learning can do better than human expert judgment.

Predicting outcomes based on "ideology" of judges is another project. The main problem with it as operationalized is that the main proponents have been unpardonably imprecise about whether "ideology" means extra-legal political considerations or jurisprudential cleavages internal to law that can generate systematic differences that might well correlate w/ ideology or even be labeled ideology if one finds that way of characterizing differences in how to animate incompletely specified legal norms ("equal protect," "free speech," "unreasonable restraint of trade," "fraud" etc.) helpful. The proponents of the ideology thesis present their results as if they were finding resort to extra-legal considerations -- when clearly they aren't.

On top of that, they do poorly in predicting case outcomes. That goes to whether the "ideology thesis" proponents are even doing well in the first project -- of coming up w/ an automated variant of what professional judgment does in predicting cases outcomes. But some serious schoalrs are making progress -- scholars more serious & forthcoming about the difficulties they face tha nthe "ideology thesis" proponents & less hung up on whether "ideology" is doing the work in their predictions.

The experiment we did was designed to help show that judges predictably resist extra-legal ideological considerations notwithstanding the judges' manifest ideological differences.

What you propose is worth doing. But I think it would be useful to do it only if we had a case-outcome measure that distingished between the influence of extra-legal and legally instrinc contributions that worldviews make in "disagreement" cases. The latter sort of contribution could help to improve "prediction" models but wouldn't support the "ideology thesis" claim that jduges are not deciding onthe basis of "law."' If cutural congeniality of outcomes to extra-legal considerations explain variance, then that's a corroboration f the "ideology thesis" in the "non-law" sense that the scholars of that school claim w/o warrant to be finding.

Of course, it might be the case that judges just don't disagree that much in cases in which the grounds for doing so are all extralegal! In our experiment, the statutes were genuinely amibugious; it would have been easy for judges not to converge. They did b/c they actually

sharethe equivalent of an ideology or worldview. That's what Llewellyn's situation sense is.