## Making sense of the " 'hot hand fallacy' fallacy," part 1

*It never fails! My own best efforts (here & here) to explain the startling and increasingly notorious paper by Miller & Sanjurjo have prompted the authors to step forward and try to restore the usual state of perfect comprehension enjoyed by the 14.3 billion regular readers of this blog. They have determined, in fact, that it will take three*

*separate guest posts to undo the confusion, so apparently I've carried out my plan to a [GV]T.*

*As cool as the result of the M&S paper is, I myself remain fascinated by what it tells us about cognition, particularly among those with exquisitely fine-tuned statistical intuitions. How did the analytical error they uncovered in the classic "hot hand fallacy" studies remain undetected for some thirty years, and why does it continue to provoke stubborn resistance on the part of very very smart people?? To Miller & Sanjurjo's credit, they have happily and persistently shouldered the immense burden of explication necessary to break the grip of the pesky intuition that their result "just can't be right!"*

* ***Joshua B. Miller & Adam Sanjurjo**

Thanks for the invitation to post here Dan!

Here’s our plan for the upcoming 3 posts:

- Today’s plan: A bit of the history of the hot hand fallacy, then clearly stating the bias we find, explaining why it invalidates the main conclusion of the original hot hand fallacy study (1985), and further, showing that correcting for the bias flips the conclusion of the original data, so that it now can be used as evidence supporting the existence of meaningfully large hot hand shooting.
- Next post: Provide a deeper understanding of how the bias emerges.
- Final post: Go deeper into potential implications for research on the hot hand effect, hot hand beliefs, and the gambler’s fallacy.

**Part I**

In the seminal hot hand fallacy paper, Gilovich, Vallone and Tversky (1985; “GVT”, also see the 1989 Tversky & Gilovich “Cold Facts” summary paper) set out to conduct a truly informative scientific test of hot hand shooting. After studying two types of in game shooting data, they conducted a controlled shooting study (experiment) with the Cornell University men’s and women’s basketball teams. This was an effective "...method for eliminating the effects of shot selection and defensive pressure" that were present as confounds in their analysis of game data (we will return to the issue of game data in a follow up post; for now click to the first page of Dixit & Nalebuff’s 1991 classic book “Thinking Strategically”, and this comment on Andrew Gelman’s blog). While the common use of the term “hot hand” shooting is vague and complex, everybody agrees that it refers to a *temporary *elevation in a player’s ability, i.e. the probability of a successful shot. Because hot state is *unobservable* to the researcher (perhaps not the player/teammate/coach!)*,* we cannot simply measure a player’s probability of success in the hot state; we need an operational definition. A natural idea is to take a streak of sufficient length as a good signal of whether or not a player is in the hot state, and define a player as having the hot hand if his/her probability of success is greater after a streak of successful shots (hits), than after a streak of unsuccessful shots (misses). GVT designed a test for this.

Suppose we wanted to test whether Stephen Curry has the hot hand; how would we apply GVT’s test to Curry? The answer is that we would have Curry attempt 100 shots at locations from which he is expected to have a 50% chance of success (like a coin). Next, we would calculate Curry’s field goal percentage on those shots that immediately follow a streak of successful shots (hits), and test whether it is bigger than his field goal percentage on those shots that immediately follow a streak of unsuccessful shots (misses); the larger the difference that we observe, the stronger the evidence of the hot hand. GVT performed this test on the Cornell players, and found that this difference in field goal percentages was statistically significant for only one of the 26 players (two sample t-test), which is consistent with the chance variation that the coin model predicts.

Now, one can ask oneself: if Stephen Curry doesn’t get hot, that is, for each of his 100 shot attempts he has exactly a 50% chance of hitting his next shot, then what would I expect his field goal percentage to be when he is on a streak of three (or more) hits? Similarly, what would I expect his field goal percentage to be when he is on a streak of three (or more) misses?

Following GVT’s analysis, one can form two groups of shots:

Group “3hits”: all shots in which the previous three shots (or more) were a hit,

Group “3misses”: all shots in which the previous three shots (or more) were a miss,

From here, it is natural to reason as follows: if Stephen Curry always has the same chance of success, then he is like a coin, so we can consider each group of shots as independent; after all, each shot has been assigned at random either to one of three groups: “3hits,” “3misses,” or neither. So far this reasoning is correct. Now, GVT (implicitly) took this intuitive reasoning one step further: because all shots, which are independent, have been assigned at random to each of the groups, we should expect the field goal percentages to be the same in each group. This is the part that is wrong.

Where does this seemingly fine thinking go wrong? The first clue that there is a problem is that the variable that is being used to assign shots to groups is also showing up as a response variable in the computation of the field goal percentage, though this does not fully explain the problem. The key issue is that there is a bias in how shots are being selected for each group. Let’s see this by first focusing on the “3hits” group. Under the assumptions of GVT’s statistical test, Stephen Curry has a 50% chance of success on each shot, i.e. he is like a coin: heads for hit, and tails for miss. Now, suppose we plan on flipping a coin 100 times, then selecting at random among the flips that are immediately preceded by three consecutive heads, and finally checking to see if the flip we selected is a heads, or a tails. Now, before we flip, what is the probability that the flip we end up selecting is a heads? The answer is that this probability is not 0.50, but 0.46! Herein lies the selection bias. The flips that are being selected for analysis are precisely the flips that are immediately preceded by three consecutive heads. Now, returning to the world of basketball shots, this way of selecting shots for analysis implies that for the “3hits” group, there would be a 0.46 chance that the shot we are selecting is a hit, and for the “3misses” group, there would be a 0.54 chance that the shot we are selecting is a hit.

Therefore, if Stephen Curry does not get hot, i.e. if he always has a 50% chance of success for the 100 shots we study, we should expect him to shoot 46% after a streak of three or more hits, and 54% after a streak of three or more misses. This is the order of magnitude of the bias that was built into the original hot hand study, and this is the bias that is depicted in Figure 2 on page 13 of our new paper, and a simpler version of this figure is below. This bias is large in basketball terms: a difference of more than 8 percentage points is nearly the difference between the median NBA Three Point shooter, and the very best. Another way to look at this bias is to imagine what would happen if we were to invite 100 players to participate in GVT’s experiment, with each player shooting from positions in which the chance of success on each shot were 50%. For each player check to see if his/her field goal percentage after a streak of three or more hits is higher than his/her field goal percentage after a streak of three or more misses. For how many players should we expect this to be true? Correct answer: 40 out of 100 players.

This selection bias is large enough to invalidate the main conclusion of GVT's original study, without having to analyze any data. However, beyond this “negative” message, there is also a way forward. Namely, we can re-analyze the original Cornell dataset, but in a way invulnerable to the bias. It turns out that when we do this, we find considerable evidence of the hot hand in this data. First, if we look at Table 4 in GVT (page 307), we see that, on average, players shot around 3.5 percentage points better when on a hit streak of three or more shots, and that 64% of the players shot better when on a hit streak than when on a miss streak. While GVT do not directly analyze these summary averages, given our knowledge of the bias, they are telling (in fact, you can do much more with Table 4; see Kenny LJ respond to his own question here). With the correct analysis (described in the next post), there is statistically significant evidence of the hot hand in the original data set, and, as can be seen in Table 2 on page 23 of our new paper, the point estimate of the average hot hand effect size is large (further details in our “Cold Shower” paper here). If one adjusts for the bias, what one now finds is that: (1) hitting a streak of three or more shots in a row is associated with an expected 10 percentage points boost in a player’s field goal percentage, (2) 76% of players have a higher field goal percentage when on a hit vs. miss streak, (3) and 4 out of 26 players have a large enough effect to be individually significant by conventional statistical standards (p<.05), which itself is a statistically significant result on the number of significant effects, by conventional standards.

In a later post, we will return to the details of GVT’s paper, and talk about the evidence for the hot hand found across other datasets. If you prefer not to wait, please take a look at our Cold Shower paper, and related comments on Gelman’s blog).

In the next installment, we will discuss the counter-intuitive probability problem that reveals the bias, and explain what is driving the selection bias there. We will then discuss some common misconceptions about the nature of the selection bias, and some very interesting connections with classic probability paradoxes.

## Reader Comments (12)

Cool! I look forward to the explanation of why my previous refutation was wrong!

@NiV--

you should restate it, I suspect. I promise I won't say, "we've discussed this before, no?"

Easier to link it. See the comments sections here and here.

In summary, it's due to averaging group percentages based on different sample sizes.

(a + b + c + ...) / (d + e + f + ...) is not generally equal to a/c + b/d + e/f + ... .

Ooops!

Should be:

(a + b + c + ...) / (d + e + f + ...) is not generally equal to a/d + b/e + c/f + ... .

Why is that? What is the equation that gives 0.46 as the result?

Also, how do you express the question more precisely?

Is it: P(?=H When ...?HHH...) = 0.46 ?

Using that notation your claim is P(?=H When ...?HHH...) > 0.5, yes?

@Cortlandt:

Don't want to deter M&S from answering but If we sample from a finite sequence of 100-coin tosses, P(H) = 0.50, but P(H|HHH) = 0.46

The math takes a lot of work. Easier just to simulate.

"Why is that? What is the equation that gives 0.46 as the result?"The process is that you toss a coin 100 times, in which there will be a *variable* number of instances of three successive heads HHH. For each set of 100 tosses you find the fraction of HHH instances that are followed by another H. You then find the average of these fractions. This is what gives the 0.46.

This is akin to conducting elections in districts each containing 100 voters, evenly split between Democrat and Republican supporters, in each of which a variable number of voters choose to vote. In the first district, there are 10 voters voting, and they all vote Republican. That's 10/10 Republican. In the next ten districts one voter votes, and he or she votes Democrat. That's 0/1, 0/1, 0/1, ..., 0/1. Dan's question is equivalent to asking what is the probability of a voter who chooses to vote voting Republican?

Dan says that if you look at the 11 fractions, we've got one 100% and ten 0% results, the average percentage is 9.1%. Only 9.1% of voting voters vote Republican when you survey 100-voter districts.

I say that if you look at the twenty voters who voted, you've got ten of them voting Republican. That's 50%.

The argument is over which is the correct method to calculate what the probability is of a voter who chooses to vote voting Republican. And in particular, whether a survey company surveying 100 people and finding 50% of likely voters planning to vote Republican was counter-intuitively evidence of a massive swing in favour of Republicanism, since we should only expect 9.1% to have said so when the population as a whole is evenly balanced.

Likewise, simulating the coin tosses and counting instances of HHHH versus HHHT, the simulation I showed in one of my earlier posts found that P(H|HHH) was still 0.5, if you calculated it by my method.

"Also, how do you express the question more precisely? Is it: P(?=H When ...?HHH...) = 0.46 ?"Try P( x(i+3) = H given ( x(i) = H and x(i+1) = H and x(i+2) = H ) ) where x(i) is the ith result in a sequence of statistically independent fair coin tosses.

@NiV:

M&S will, I'm sure, have something to say. But what I would say, have said, is that your critique doesn't recognize what the question is.

It's not whether Pr(H) in a coin toss is 0.50. It's not whether Pr(H|HHH) = 0.50 if one just starts flipping coins and observing the result.

It is what the null should be when testing for the hot hand in a sample of past performances, & whether a particular set of studies used the wrong one.

(also, the problem with such a sampling strategy has *nothing* to do with 'variable' number of P(H|HHH)in sets of 100 tosses; you are fixated on the 4-coin toss illustration, which I believe is an attractive nuisance in the M&S paper.)

"M&S will, I'm sure, have something to say."There's no hurry. They should take as much time as they need to make the explanation clear.

"But what I would say, have said, is that your critique doesn't recognize what the question is."The question is: why is P(H | HHH) = 0.46 instead of 0.50? What's the formula that gives this? That was what was being asked.

Just saying "We don't understand why, the computer told us that answer." is unsatisfactory. We need to understand why our familiar intuition is 'wrong', or it will

stay'wrong' and we'll be liable to making the same mistake again elsewhere.In particular, if I'm making a mistake somewhere in my argument above, I want to know about it.

"It's not whether Pr(H) in a coin toss is 0.50. It's not whether Pr(H|HHH) = 0.50 if one just starts flipping coins and observing the result."Good! Then please can everyone stop saying P(H | HHH) is not 0.50!

"It is what the null should be when testing for the hot hand in a sample of past performances, & whether a particular set of studies used the wrong one."My reading of the paper was that they didn't say. They gave insufficient detail about how they tested significance to determine their method. Some more specific quotes or calculations to show that they actually did would be useful. Got any?

"(also, the problem with such a sampling strategy has *nothing* to do with 'variable' number of P(H|HHH)in sets of 100 tosses; you are fixated on the 4-coin toss illustration, which I believe is an attractive nuisance in the M&S paper.)"I hadn't even mentioned the 4-toss example up to now on this occasion! I was talking entirely about your 100-toss example. They work by the same principle though, as I showed with my R code the time before last. Correctly calculated, simulation of the 100-toss example still gives P(H | HHH) = 0.50.

I don't understand what your objection is to the 4-toss example. It seems to me to be easier to see what's going on with a smaller, simpler example. Do you think the maths works differently for 4 compared to 100?

@NiV:

I hope they will reply to you! The most important thing for them left to do is figure out a way to convey their result that overcomes persistent disbelief by intelligent & reflective people.

No one-- or no one who understands the M&S paper-- says "Pr(H|HHH) < 0.50" period; they say that "Pr(H|HHH) < 0.50 when sampling from a finite sequence without replacement." The same way that Pr(♥|♥♥♥) < 0.25 when one is dealing from a single deck of cards.

If someone did a study of whether poker players get "hot hands" & found that the mean Pr(♥|♥♥♥)=0.25 in a sample of 1,000 100-hand sessions, that would be very strong proof that "hot hands" exist in poker. (One can see the logic of this w/o doing the math or running the simulations--but feel free!)

GVT didn't get this. They purported to find that there's no "hot hand" in basketball b/c Pr(Hit|Hit Hit Hit) = Pr(Hit) when sampling from finite sequences of shots by various players. In fact, if that is what their data show, then their data *support* the inference of "hot hands."

The paper says nothing more than that. It really is that simple.

"Pr(H|HHH) < 0.50 when sampling from a finite sequence without replacement."OK, let's consider that.

I first generate a single sequence of 100,000 coin tosses. I count up the instances of HHH, determine what proportion are followed by another H, and find the answer very close to 0.5.

Now I take the

samesequence and I split it up into 1,000 blocks of 100 coin tosses each. I count the number of instances of HHH in each, and the proportion of these followed by another H, and average them, and we get 0.46. If this is the probability P(H|HHH) then the probability has changed retrospectively!Let's split the same sequence up into 20,000 blocks of 5 coin tosses each. If we find the equally likely possibilities that don't come out 0/0 (why are we allowed to throw these away?) then we get:

HHHTT 0/1

HHHTH 0/1

HHHHT 1/2

HHHHH 2/2

THHHT 0/1

THHHH 1/1

Adding up the numerators and denominators separately, we've got 0+0+1+2+0+1 = 4 following heads out of 1+1+2+2+1+1 = 8 triple-head sequences, so the overall odds are as expected.

But looking at the average of the proportions, we get (0/1 + 0/1 + 1/2 + 2/2 + 0/1 + 1/1)/6 = 2.5/6 = 0.417.

The reason these expressions are not

algebraicallythe same is that the denominators are not all the same. The denominators represent the 'sample size' for the count of HHH instances. (Like the number of voters who choose to vote in a district.)But you're telling me that the

actualreason they're not the same is that we're sampling from a finite sequence without replacement? That picking a block of 5, there are a fixed number of heads in the 'deck' and that after having observed some (the HHH) that there are fewer of them 'left' to appear in the subsequent places? So P(H|HHH) depends on the size of block we decide to split the sequence into, which we only decide after we've tossed the coins? And which we can then change again?Can you show me how that works?

Unlike card decks, we seem to have

differentnumbers of H's in the sequences above, so presumably you means something else. Is there a better analogy for this than cards?This also seems to mean that we really ought to be writing P(H|HHH and B) where B is our (random?) decision about block size, since it affects the result. The decision about B can occur at a later time than the H occurring.

This isn't necessarily backwards-in-time causality - Like Sherlock Holmes, we can talk about the probability of earlier unobserved events conditionally on later observed ones. But this interpretation would raise some interesting questions too, since we wouldn't be keen on the idea that the sequence of H's influenced our later choice of block size, either, or that both are influenced by some common event in the past of both. Isn't our choice of B a matter of free will?

I'm finding this an interesting debate! Trying to come up with better explanations for somebody else helps develop a deeper intuition about it for myself. Many thanks!

@NiV:

You still aren't addressing the questions that M&S do: how to calculate the null when sampling from a finite of performances to test the hot hand hypothesis? & did GVT do that correctly? If you were, you'd see that you are reproducing M&S's critique of GVT (more or less; you haven't filled out the entire sample space in your 5-sequence coin-toss illustration). Look at p. 21 of their paper, e.g.: they show there will be a "bias" toward observing "tails" after successive heads if one samples from a finite sequence; you are showing the same thing in your point about "different size denominators." They are trying to illustrate GVT's error; you keep thinking they are making the claim that Pr(H|HHH) < 0.50 in a binomial distribution.

On your thought experiment: Try it w/ 1000 decks of cards. Shuffle them altogether. Deal them all recording the fraction of times in which you get a 4th ♥ after 3 consecultive ♥'s. You will indeed find that Pr(♥|♥♥♥) was very close to 0.25.

Now take a single deck. Deal out all 52 cards recording the fraction of times in which you get a 4th ♥ after 3 consecultive ♥'s. Reassemble the deck, shuffle, & repeat. 1000 times. Figure out the mean Pr(♥|♥♥♥) for a single deal of 52. I'm sure you'll agree with me it is < 0.25.

*Exactly* the same principle applies here. The probability of a "success" when randomly sampling w/o replacement from a finite sample of dichotomous ("success"/"failure") outcomes can't be modeled as a binomial distribution -- the proportion of "successes" remaning in the sample (just like the proportion of ♥'s in a deck of cards) changes after every draw (it goes down, obviously, after every "success"--and all the more so after a string of successes).

As the number of events in the sample gets arbitrarily large, though, the difference between Pr(Success|Success) and Pr(Success) will get smaller and smaller. B/c as the sample gets arbitrarily large, the difference between sampling w/o replacement & sampling w/ replacement -- or simply observing the operation of a Bernoulli process, flipping coins -- disappears.

Accordingly, if someone finds that Pr(H) = Pr(H|HHH) rather than Pr(H|HHH) < Pr(H) when sampling w/o replacement from a finite sample, that's evidence that the outcomes were not generated by a Bernouilli process. How strong that evidence is depends on the size of the sample.

If you calculate or simulate the difference between Pr(H) and Pr(H|HHH) when sampling from a 100-toss sequence w/o replacement, you'll find that there's still a pretty big difference -- 4% or so.

I don't know if the GVT data support the "hot hand conjecture" for basketball players. But I'm positive, after reading M&S a few times & learning to ignore the trap of looking for a mistake in the 4-toss sequence illustration, that GVT used the wrong null -- Pr(Hit) = Pr(Hit|Hit HIt Hit) -- in analyzing their data.

Sorry that M&S are not responding. I'm sure they'd take a different tack; indeed, their exposition is very much like the one you are making (as I said, you are making the same point they do!)