## Reflections on "System 2 bias," part 1 of 2 (I think)

*Some thoughts about Miller & Sanjurjo, Part 1 of 2:*

Most of the controversy stirred up by M&S centers on whether they are *right* about the methodological defect they detected in Gilovich, Vallone, and Tversky (1985) (GVT) and other studies of the “hot hand fallacy.”

I’m fully persuaded by M&S’s proof. That is, I get (I think!) what the problem is with GVT’s specification of the null hypothesis in this setting.

Whether in fact GVT’s *conclusions* about basketball shooting hold up once one corrects this defect (i.e., substitutes the appropriate null) is something I feel less certain of, mainly because I haven’t invested as much time in understanding that part of M&S’s critique.

But **what interests me even more is what the response to M&S tells us about cognition**.

The question, essentially, is how could so many extremely smart people (GVT & other empirical investigators; the legions of teachers who used GVT to instruct 1,000’s of students, et al.) have been so wrong for so long?! Why, too, does it remain so difficult to make those intelligent people *get* the problem M&S have identified?

The answer that makes the most sense to me is that the GVT and others were, ironically, betrayed by intuitions they had formed for sniffing out the general public’s intuitive mistakes about randomness.

The argument goes something like this:

*I. The quality of cognitive reflection depends on well calibrated non-conscious intuitions. *

There is no system 2 ex nihilo. Anything that makes it onto the screen of conscious reflection (System 2) was moments earlier residing in the realm of unconscious thought (System 1). Whatever yanked that thought out and projected it onto the screen, moreover, was, necessarily, an unconscious mental operation of some sort, too.

It follows that reasoners who are adept at System 2 (conscious, deliberate, analytical) thinking necessarily possess well behaved System 1 (unconscious, rapid, affect-laden) intuitions. These intuitions *recognize* when a decisionmaking task (say, the detection of covariance) merits the contribution that System 2 thinking can make, and *activates* the appropriate form of conscious, effortful information processing.

In anyone lucky enough to have reliable intuitions of this sort, what *trained* them was, most likely, the persistent exercise of reliable and valid System 2 information processing, as brought to bear over & over in the process of learning how to be a good thinker.

In sum, System 1 and System 2 are best though of *not* as discrete and hierarchical modes of cognition but rather as integrated and reciprocal ones.

*II. Reflective thinkers possess intuitions calibrated to recognize and avoid the signature lapses in System 1 information processing.*

The fallibility of intuition is at the core of all the cognitive miscues (the availability effect; hindsight bias; denominator neglect; the conjunction fallacy, etc.) cataloged by Kahneman and Tversky and their scholarly descendents (K&T et al.). Indeed, *good* thinking, for K&T et al., consists in the use of conscious, effortful, System 2 reflection to “override” System 1 intuitions when reliance on the latter would generate mistaken inferences.

As discussed, however, System 2 thinking cannot plausibly be viewed as operating independently of its own stable of intuitions, ones finely calibrated to recognize System 1 mistakes and to activate the sort of conscious, effortful thinking necessary to override them.

*III. But like all intuitions, the ones relfective people rely on will be subject to characteristic forms of failure—ones that cause them to overestimate instances of overreliance on error-prone heuristic reasoning**.*

It doesn’t follow, though, that good thinkers will never be misled by *their* intuitions. Like *all* forms of pattern recognition, the intuitions that good thinkers use will be vulnerable to recurring illusions and blind spots.

The sorts of failures in information processing that proficient thinkers experience will be predictably *different from the* ones that poor and mediocre thinkers must endure. Whereas the latter’s heuristic errors expose them to one or another form of overreliance on System 1 information processing, the latter’s put them at risk of too readily perceiving that exactly that form of cognitive misadventure accounts for some pattern of public decisionmaking.

The occassions in which this form of “System 2 bias” will affect thinking are likely to be rare. But when they occur, the intuitions that are their source will cling to individuals’ perceptions with the same dogged determination that the ones responsible for heuristic System 1 biases do.

Something like this, I believe, explains how the “ ‘hot hand fallacy’ *fallacy*” took such firm root.

It’s a common, heuristic error to believe that independent events—like the outcome of two coin flips—are interdependent. Good reasoners are trained to detect this mistake and to fix it before making a judgment.

GVT spotted what they surmised was likely an instance of this mistake: the tendency of fans, players, and coaches to believe that positive performance, revealed by a short-term string of successful shots, indicated that a player was “hot.”

They tested for this mistake by comparing whether the conditional probability of a successful basketball shot following a string of successes differed significantly from a player’s unconditional probability of making a successful shot.

It didn’t. Case closed.

What didn’t occur to them, though, was that where one uses the sampling method they used—drawing from a finite series without replacement—Pr(basket|success, success, sucses) – Pr(basket) should be < 0. How *much* below zero it should be has to be determined analytically or (better) by computer simulation.

So* if* in fact Pr(basket|success, success, sucses) – Pr(basket) = 0, the player in question *was* on an improbable hot streak.

Sounds wrong, doesn’t it? Those are your finely tuned intuitions talking to you; yet they’re wrong. . . .

I’ll finish off thise series “tomorrow.™” In the meantime, read this problem & answer the three questions that pertain to it.

*Reference*

Gilovich, T., Vallone, R. & Tversky, A. The hot hand in basketball: On the misperception of random sequences. *Cognitive Psychology* 17, 295-314 (1985).

* *

* *

## Reader Comments (12)

Dan,

That's what I was talking about with the skin-rash-gun-control test! Well, almost - I claimed that was I, II, and the first half of III - because the bias in III was identity based, so nothing

elitist. I first claimed that all system 2 functioning must be triggered by some system 1 alarm heuristic, and then claimed that such an heuristic, being in system 1, was subject to biased malfunctions, such as failing to trigger when social feelings are very pleasant (and perhaps overtriggering when unpleasant).However, note that overtriggering such an alarm heuristic should only result in extra work for system 2, not necessarily a biased outcome. Unless, of course, that system 2 work is itself error-prone, and requiring yet another system 1 heuristic to watch over its shoulder and trigger even more system 2 work to check for and correct that error - and that second system 1 heuristic was suppressed. That seems to be needed in the hot-hand-fallacy-fallacy case, right?

Also, maybe this (I, II, and all of III) explains why you are resistant to doubting backfire following Wood&Porter (instead of backfiring on that yourself, hence not providing a proof of existence of backfire)?

On to the Margolis test:

a. 50

b. 50

c. 67

Although, now I am suspecting that the above test was too easy, hence there must be a trick in it somewhere... Or, maybe the trick is that there is no trick, but that I'd be on the lookout for a trick due to this discussion...

Just to show that I am capable of learning, I came up with this off the bat:

67

67

67

(You could have either side of the blue/blue chip, or red/red chip facing up.)

Took me longer to get to that the first time you asked:

http://www.culturalcognition.net/blog/2016/12/26/meta-probabilistic-thinking-quiz.html

Although thinking about it more, and like last time, I think that this is as much a language problem as it is one of logic and statistical reasoning.

It is interesting to me, however, that based on the previous exposure, although I didn't remember any of the details related to the discussion, I immediately had a different perspective on the syntax of the question than I did the first time.

That could just be a random effect (any given time being asked the question, I might interpret the language in a given way)...but I suspect that although my brain functioning hasn't been enhanced by my previous experience, at a sub-concious level I've expanded my interpretive frame due to my previous experience. It's as if my vocabulary has been expanded - as often happens at a more sub-concious level.

OK - am now seeing things as Joshua described: A and B both 67. I knew there was a trick! Because the event is the picking of the chip AND placing it down - so there are 6 (3 chips X 2 sides) events total. Then conditionalizing on blue (or red) being up removes 3 of them. Of the 3 remaining, 2 have same color down as up. Hence A and B are both 67%.

Well, at least we know there's no partisan cognition going on, else we would have both given higher % to A (because blue) than B (because red).

Although, if NiV or Ecoute had given the 67% answer, would that have inspired me to change my mind?....

Hi Dan-

Nice post.

I think there is some truth in point #1---"There is no system 2 ex nihilo"---but I am not sure I fully agree. On the one hand, yes, with training humans can automate system 2 so that it becomes relatively intuitive/automatic/unconscious. On the other hand, isn't that system 1? I think of system 2 as deliberate processing, e.g. like using a tool, applying an algorithm, or appealing to a set of rules. In the Laplace book we talked about there is a nice quote:

To support this point, here are a couple of anecdotes that might help:

1. When I present the GVT analysis (or the simple Monty Hall version) to mathematicians and game theoristsm the brilliant 5-sec response was not the typical pattern (It never happened actually). Instead I was impresesed that they (often) did not voice an intuitive response. They seemed suspicious of intuition. They were not going to be fooled {experience was a dear teacher?}. Instead they (often) carefully and diligently attempted to fit it with the appropriate probability model or sampling process so they could crank out the implications using known rules of probability.

2. When I have presented GVT's analysis to statistician colleagues, their intuitive response was that GVT's analysis appeared to be under-powered. No one mentioned that is could be biased. Typically they voiced surprise when I mentioned that it is biased. Presumably the bias was obscured because the paired t-test seems so simple (each player has a pair of shooting percentages---one after streaks of hits and the other after streaks of missed). There are other

3. My initial explanation folks missed this for 30 years was because it was missed initially, and everyone else didn't scrutinize carefully because they assumed someone would have noticed an issue it were there. That's why we didn't notice until 2 years after we began our empirically-oriented project. I revised that after speaking to statisticians who worked on the hot hand and looked for problems with GVT, but didn't see that problem. My current view is that ther are many reasons why people missed it, but the main reason is that people didn't take the statistician equivalent of the cautious and methodical mathematician approach---model the data generating process and simulate. System 1 cannot be trusted.

P.s. Here is a minor correction:

You write:

1. They were interested in comparing two *conditional* probabilities: (i) Pr(success| recent success), (ii) Pr(success| recent failure)

2. In their analysis they compared two conditional probability *estimates*, not the probabilities themselves.

"Although, if NiV or Ecoute had given the 67% answer, would that have inspired me to change my mind?...."Let's find out. My answer was 2/3, 2/3, 2/3. Do you still think the same? :-)

"Instead I was impresesed that they (often) did not voice an intuitive response. They seemed suspicious of intuition. They were not going to be fooled {experience was a dear teacher?}."Yes! Definitely! Mathematicians love puzzles where the intuitive answer is wrong, and present them to one another often. There are books full of such examples! We're always looking for the catch.

"No one mentioned that is could be biased. Typically they voiced surprise when I mentioned that it is biased."When Dan first brought the subject up here, I had a look at the paper, and at the discussion of the flaw in it, and decided that the original paper was ambiguous about how they described one of the tests they did. They said they'd done a statistical test, but didn't give enough details about the calculation to tell if they had fallen into the trap or not. There was circumstantial evidence that they might have, but since they didn't show their working, just reported the test output, there was no way to tell. Thus a statistician reading the paper wouldn't have seen anything that stood out as actually wrong, because the erroneous reasoning was not actually presented in the paper.

See discussion here and here. The bit where I say:

"My current view is that ther are many reasons why people missed it, but the main reason is that people didn't take the statistician equivalent of the cautious and methodical mathematician approach---model the data generating process and simulate."If they don't specify what they did, how can you model it?

The problem is the compressed style of academic journal papers, where 'trivial' details of calculation are left out and only the bare bones of the method and the conclusions are reported. The reader is supposed to be able to fill in the details for themselves. So if the original author makes an error but doesn't show their working, readers reading it are quite likely to fill in the *correct* method in their head, and then say to themselves "that looks correct".

But if someone reading the paper fills in the *wrong* method in their head, and then says "Oh hang on, that's wrong!", it can highlight the fact that the method was never actually fully specified, nor was the data and working provided, so we have no idea whether the calculation was done correctly or not. It hasn't been checked.

The original scientific purpose of publishing results in journals was so that other scientists could check them. But there was a cultural shift sometime during the past century, and now people seem to expect the journal peer review to do the checking (which of course it doesn't) and journals are now seen as a record of achievement for the purposes of career progression. As such, it's now actually counterproductive for "career" academics to present enough data and details for someone else to be able to reproduce their results. Their attitude is: "Why should I make the data available to you, when your aim is to try and find something wrong with it?"

Hi NiV

Just saw your comment in my email inbox

I agree with much of what you say, pretty much all of it, save your one point about them not describing their test well enough to know what they were doing. While I agree that they could have been clearer, I didn't see any ambiguity when I read their description carefully.

They explicitly state that they perform a paired t-test. This means each shooter contributes 2 numbers to the test, no ambiguity there. So the only question is if there is any ambiguity in the numbers that they use. They write that they compare each player's shooting percentage after hitting the previous shot (or shots) to his/her shooting percentage after missing the previous shot (or shots). Those are the two numbers right there. Perhaps you intended to write that there is ambiguity in how to calculate these percentages? This is possible, but my sense is that the most natural calculation is the one that they used, the alternative non-biased calculations are lossy and a bit weird. I don't think my sense is unique. To add evidence to this, if it were ambiguous to reasonably trained readers then the two replications of their study -- Koehler & Conley (2003) and Avugos et al. (2013) -- wouldn't have fallen victim for the exact same bias, but they do. Further, the most critical paper of all, from the statistician Robert Wardrop @ U. Wisc. - Madison (1999), ran simulations of their test and concluded that it was under-powered for the purposes of identifying the hot hand at the individual level (which is true), but he did not note the bias.

"Just saw your comment in my email inbox"No problem. Thanks for replying!

"They write that they compare each player's shooting percentage after hitting the previous shot (or shots) to his/her shooting percentage after missing the previous shot (or shots). Those are the two numbers right there."It depends on whether they total up *all* the shots after hitting/missing previous shots, or if they do something like first taking a percentage per player, and then averaging/comparing players. From the way they mentioned z statistics, my assumption was that they first grouped first by player, and then considered the percentage for each player as an approximately Normal distribution, and tested that individually, which is fine (but weaker than combining all the players). But then they combine all the player data by considering whether the spread of player percentages - above or below expected - was reasonable, which I'm not so sure about.

The effect (as Dan describes it) is the result of averaging averages with different sample sizes. If you count *all* the cases of P(H|H), P(M|H), etc. across all the data, you get the expected probabilities. But if you first divide the data into fixed-size blocks, calculate a percentage for each block, and then look at the mean percentage over all the blocks, it's biased. This is because different blocks have different numbers of 'hit' and 'miss' runs, and therefore are using different different sample sizes.

In the coin toss example with blocks of 3, P(H|H) / (P(H|H) + P(T|H)) =

(0+0+0+0+1+0+1+2) / (0+0+1+1+1+1+2+2) = 50%, which is the correct way to calculate it.

But if you average each block first...

( [drop 0/0 + 0/0 + ] 0/1 + 0/1 + 1/1 + 0/1 + 1/2 + 2/2) / 6 = 41.7%

(In general I think it

ispossible to combine averages with different sample sizes, but you have to do aweightedaverage, giving more weight to those contributors with a smaller variance. I've not followed that line of thought any further, though.)It's evident that the answers given by the latter average-of-averages approach have got little to do with the objective hot-hand probabilities themselves, because you can collect all the data first, and only then decide what block length to use afterwards. A statistic that gives different answers for the exact same data set, depending on processing decisions that are

only made after the experiment has been completed, is obviously not giving you an objective view of the data.Some of the results in the GVT paper are OK. In particular, I think the test on each individual player is a valid test of whether that player exhibits a 'hot hand'. But in the bits where all the players together are considered collectively, I'm not sure how the test boundaries are calculated, or if it matters.

Although to be fair, I didn't read the GVT paper as carefully as I might. I could have missed something.

Hi NiV

You mention: "Although to be fair, I didn't read the GVT paper as carefully as I might. I could have missed something."

Thank you for the honesty. Yes, you did indeed miss something, please allow me to attempt to explain in a different way.

In a paired t-test, which is the test they use and mention that they use (not the z-test you mention), you compute two numbers per player, and then compute the difference for each player, d_i:=Prop_i(hit after hit)-Prop_i(hit after miss). There is no ambiguity in how to compute this number, it is not an average of averages, it is a difference in proportions from a single sequence. Now the assumption of a paired t-test is that for each player i, the difference d_i is drawn from a normal distribution with unknown variance, and the null hypothesis is that the mean of this distribution is zero. There are two mistakes here: for an i.i.d. process (and AR(k) processes): (1) the distribution is not normal from a single finite sequence, (2) the mean is not zero in a single finite sequence (except for the rare AR process). There is another equivalent way to frame the paired t-test in which each proportion is drawn from it's own normal distribution with its own mean and the null hypothesis is that the means are the same. In any event that is not correct either.

You mention: "In particular, I think the test on each individual player is a valid test of whether that player exhibits a 'hot hand'."

Well the individual level comparisons in their conditional probability tests is biased and invalid. Again, it is not an issue of averaging averages, the bias is there when you test for a *single* sequence.

Further, three of their individual tests are entirely redundant: (i) runs, (ii) auto-correlation, (iii) prop(hit after 1 hit) vs. prop(hit after 1 miss).

To re-iterate: your concern does not apply to the two numbers *for each player*, there is only one way to compute it, which is again why I see no ambiguity in the computation and why the three other sets of authors who looked at their measures & tests in detail saw no ambiguity either.

"There are two mistakes here: for an i.i.d. process (and AR(k) processes): (1) the distribution is not normal from a single finite sequence, (2) the mean is not zero in a single finite sequence (except for the rare AR process)."I'm still not sure if I understand - although at least now I can see there is an issue here.

The statistic in question is something like #(HH)/#(Hx) - #(MH)/#(Mx) where #(MH) is the number of miss-then-hit sequences, and #(Mx) is the number of miss-anything sequences. Each count has something similar to but not quite a Binomial distribution, and the ratio of two Binomials gives something that strictly has no mean or variance, because of the presence of the 0/0 possibility, but if we drop these does have a small bias.

I've tried playing with a few examples, though, and the bias in the mean comes out tiny compared to the spread in the observations. by an order of magnitude. The smallest sample size shown in table 1 is 8, which is presumably going to give the biggest distortion. The difference of proportions statistic has a standard deviation of about 0.28 and a mean of about 0.03. Treating this as zero is an approximation, but it seems like a good one.

I'm interested, now, and plan to explore the question further, but am I on the right track? That the problem is that both numerator

and denominatorin each proportion is a random variable, and a ratio of random variables can be heavily distorted and heavy-tailed if the denominator has non-negligible weight near to zero. That's acompletelydifferent issue to the one I thought that you and Dan were talking about!Am I still misunderstanding?

OK, I've been tinkering with some R code to see if I can understand this.

The following code looks at the distribution of the statistic for coin-toss sequences of length 100 - it takes about 5 minutes to crunch the numbers on my laptop. It shows a slight bias in the mean of about -0.01, compared to the standard deviation of 0.1. The blue curve is a Normal distribution with the same mean and standard deviation, the green curve shows what happens if we assume the mean is zero.

It's an approximation, but it looks like a reasonably good one to me. Is this the effect you're talking about?

# -------------------------------------------

trial = function(n,p) {

# Generate a length n sequence of Heads and Tails

tossn = paste(ifelse(rbinom(n,1,p),"H","T"),collapse="")

# Pull out all the length 2 substrings

subs2 = sapply(1:(n-1),function(i)substring(tossn,i,i+1))

# Count how many of each combination occur

freqtable = table(factor(subs2,levels=c("HH","HT","TH","TT")))

# Calculate difference of proportions as required and return it

unname(freqtable["HH"]/(freqtable["HH"]+freqtable["HT"]) - freqtable["TH"]/(freqtable["TH"]+freqtable["TT"]))

}

dist = function(n,p) {

# Repeat the trial a million times to get a distribution

d = replicate(1000000,trial(n,p))

# Remove invalid values

d = d[!is.na(d)]

# Show a histogram of the distribution

hist(d,breaks=20,col="red",main="Distribution of Hot Hand Statistic",xlab="Hot Hand Statistic")

# Superimpose mean, 2-sigma lines, and Normal distribution curves at actual and assumed means

abline(v=c(mean(d),mean(d)-2*sd(d),mean(d)+2*sd(d)),lwd=2)

lines(seq(-1,1,0.01),0.05*length(d)*dnorm(seq(-1,1,0.01),mean(d),sd(d)),lwd=2,col="blue")

lines(seq(-1,1,0.01),0.05*length(d)*dnorm(seq(-1,1,0.01),0,sd(d)),lwd=2,col="green")

# Return the mean and standard deviation of the distribution

c(mean=mean(d),sd=sd(d))

}

dist(100,0.5)

# ----------------------------------------------------------------

The resulting image looks like this:

http://tinypic.com/r/2cmkvtz/9

Hi NiV

Yes.

One version of the statistic is d1_i = #HH/(#Hx) - #MH/(#Mx), and we are talking about its distribution conditional an the sequence having at least one hit and miss in the first n-1 trials.

In the original paper, and in papers that follow, they talk about the difference in shooting percentage when on a streak of hits vs. on a streak of misses, and it is common to define a streak as having hit three or more shots in a row, so we are interested in d3_i = #HHHH/(#HHHx) - #MMMH/(#MMMx). We have a figure representing the sampling distribution of this statistic for a sequence of 100 shots in our paper, analogous to the one you have an image of (actually ours is conditional on 50 hits/50 missses, but that is for the purposes of illustrating the permutation test), The bias is meaningfully large in this case, around 8pp for the mean, 6pp for the median.

There is a trade-off when deciding what length of streak to condition on. The shorter the streak, the larger the measurement error, the longer the streak, the smaller the sample size (and more bias).