## Weekend update: Still fooled by non-randomness? Some gadgets to help you *see* the " 'hot hand' fallacy" fallacy

Well, I'm still obsessed with the " 'hot hand fallacy' fallacy." Are you?

As discussed previously, the classic "'hot hand' fallacy" studies purported to show that people are deluded when they perceive that basketball players and other athletes enjoy temporary "hot streaks" during which they display an above-average level of proficiency.

The premise of the studies was that ordinary people are prone to detect patterns and thus to confuse chance sequences of events (e.g., a consecutive string of successful dice rolls in craps) as evidence of some non-random process (e.g., a "hot streak," in which a craps player can be expected to defy the odds for a specified period of time).

For sure, people are disposed to see signal in noise.

But the question is whether that cognitive bias truly accounts for the perception that athletes are on a "hot streak."

The answer, according to an amazing paper by Joshua Miller & Adam Sanjurjo, is *no*.

Or in any case, they show that the purported* proof* of the "hot hand fallacy" itself reflects an alluring but *false* intuition about the the conditional independence of binary random events.

The "test" the "hot hand fallacy" researchers applied to determine whether a string of successes indicate a genuine "hot hand"--as opposed to the illusion associated with our over-active pattern-detection imaginations--was to examine how likely basketball players were to hit shots after some specified string of "hits" than they were to hit shots after an equivalent string of misses.

If the success rates for shots following strings of "hits" was not "significantly" different from the success rates for shots following strings of "failures," then one could infer that the probability of hitting a shot after either a string of hits or misses was not significantly different from the probability of hitting a shot *regardless* of the outcome of previous shots. Strings of successful shots being no longer than what we should expect by chance in a random binary process, the "hot hand" could be dismissed as product of our vulnerability to see patterns where they ain't, the researchers famously concluded.

**Wrong****!**

This analytic strategy itself reflects a cognitive bias-- an understanding about the relationship of independent events that is intuitively appealing but in fact *incorrect*.

Basically, the mistake -- which for sure should now be called the " 'hot hand fallacy' fallacy" -- is to treat the conditional probability of success following a string of successes *in a past sequence* of outcomes as if it were the same as the conditional probability of success following a string of successes in a* future or ongoing sequence*. In the latter situation, the occurrence of independent events generated by a random process is (by definition) unconstrained by the past. But in the former situation -- where one is examining a *past* sequence of such events -- that's not so.

In the completed past sequence, there is a fixed number of each outcome. If we are talking about successful shots by a basketball player, then in a season's worth of shots, he or she will have made a specifiable number of "hits" and "misses."

Accordingly, if we examine the sequence of shots after the fact, the probability the next shot in the sequence will be a "hit" will *be lower* immediately following a specified number of "hits" for the simple reason that *the proportion of "hits" in the remainder of the sequence will necessarily be lower* than it it was before the previous successful shot or shots.

By the same token, if we observe a string of "misses," the proportion of "misses" in the remainder will be lower than it had been before the first shot in the string. As a result, following a string of "misses," we can deduce that the probability has now gone up that the next shot in the sequence will turn out to have been a "hit."

Thus, it is *wrong *to expect that, on average, when we examine a past sequence of *random *binary outcomes, P(success|specified string of successes) will be *equal to* P(success|specified string of failures). Instead, in that that situation, we should expect P(success|specified string of successes) *to be less than* P(success|specified string of failures).

That means the original finding of the "hot hand fallacy" researchers that P(success|specified string of successes) **=** P(success|specified string of failures) in their samples of basketball player performances **wasn't** evidence that the "hot hand" perception is an illusion. **If P(success|specified string of successes) = P(success|specified string of failures) **within an adequate sample of sequences**, then we are observing ***a higher success rate following a string of successes than we would expect to see by chance**. *

In other words, the data reported by the original "hot hand fallacy" studies *supported* the inference that *there was* a hot-hand effect after all!

So goes M&S's extremely compelling proof, which I discussed in a previous blog. The M&S paper was featured in Andrew Gelman's *Statistical Modeling, Causal Inference *blog, where the comment thread quickly frayed and broke, resulting in a state of total mayhem and bedlam!

How did the "hot hand fallacy" researchers make this error? Why did it go undetected for 30 yrs, during which the studies they did have been celebrated as classics in the study of "bounded rationality"? Why do so many smart people find it so hard now to accept that those studies themselves rest on a mistaken understanding of the logical properties of random processes?

The answer I'd give for all of these questions is the *priority of affective perception to logical inference.*

Basically, we *see* valid inferences before we apprehend, through ratiocination, the logical cogency of the inference.

What makes people who are *good* at drawing valid inferences good at that is that they more quickly and reliably *perceive *or *feel* the right answer -- or feel the *wrongness* of a seemingly correct but wrong one -- than those less adapt at such inferences.

This is an implication of a conception of dual process reasoning that, in contrast to the dominant "System 1/System 2" one, sees unconscious reasoning and conscious effortful reasoning as *integrated and reciprocal* rather than *discrete and hierarchical*.

The "discrete & hierarchical" position imagines that people immediately form a a heuristic response ("System 1") and then, *if* they are good reasoners, use conscious, effortful processing ("System 2") to "check" and if necessary revise that judgment.

The "integrated and reciprocal" position, in contrast, says that good reasoners experience are more likely to experience an unconscious *feeling *of the incorrectness of a wrong answer, and the need for effortful processing to determine the right answer, than are people who are poor reasoners.

The *reason* the former are more likely to *feel *that right answers are right and wrong answers wrong is that they have through the use of their proficiency in conscious, effortful information processing *trained* their intuitions to alert them to the features of a problem that require the deployment of conscious, effortful processing.

Now what makes the *fallacy* inherent in the " 'hot hand fallacy' fallacy" so hard to detect, I surmise, is that those who've acquired reliable feelings about the wrongness of treating independent random events as dependent (the most conspicuous instance of this is the "gambler's fallacy") will in fact have trained their intuitions to recognize as *right* the corrective method of analyzing such events as genuinely independent.

If the "hot hand" perception is an illusion, then it definitely stems from mistaking an independent random process for one that is generating systematically interdependent results.

So fix it -- by applying a test that treats those same events as independent!

That's the intuition that the "hot hand fallacy" researchers had, and that 1000's & 1000's of other smart people have shared in celebrating their studies for three decades -- but it's **wrong wrong wrong wrong wrong!!!!!**

But because it *feels* **right right right right right **to those who've trained their intuitions to avoid heuristic biases involving the treatment of independent events as interdependent, it is super hard for them to accept that the method reflected in the "hot hand fallacy" studies is indeed incorrect.

So how does one fix that problem?

Well, no amount of *logical* argument will work! One must simply *see* that the right result is *right* first; only *then* will one be open to working out the logic that supports what one is seeing.

And at that point, one has initiated the process that will eventually (probably not in too long a time!) recalibrate one's reciprocal and integrated dual-process reasoning apparatus so as to purge it of the heuristic bias that concealed the " 'hot hand fallacy' fallacy" from view for so long!

BTW, this is an account that draws on the brilliant exposition of the "integrated and reciprocal" dual process reasoning offered by Howard Margolis.

For Margolis, *reason giving* is not what it appears: a recitation of the logical operations that make an inference valid.

Rather it is a process of engaging another reasoner's *affective perception*, so that he or she *sees* why a result is correct, at which point the "reason why" can be conjured through conscious processing. (The "Legal Realist" scholar Karl Llewellyn gave the same account of legal arguments, btw.)

To me, the way in which the " 'hot hand fallacy' fallacy" fits Margolis's account -- and also Ellen Peters's of the sorts of heuristic biases that only those high in Numeracy are likely to be vulnerable too-- is what makes the M&S paper so darn compelling!

But now...

If you, like me and 10^6s of others, are *still* having trouble believing that **the analytic strategy of the original "hot ****hand" studies was wrong**, *here* are some gadgets that I hope will enable you, if you play with them, to *see* that **M&S are in fact right**. Because once you *see* that, you'll have vanquished the intuition that bars the path to your conscious, logical apprehension of why they are right. At which point, the rewiring of *your* brain to assimilate M&S's insight, and avoid the "'hot hand fallacy' fallacy" can begin!

Indeed, in my last post, I offered an argument that was in the nature of helping you to imagine or see why the " 'hot hand fallacy' fallacy" is wrong.

But here--available *exclusively* to the 14 billion regular subscribers to this blog (don't share it w/ nonsubscribers; make them bear the cost of not being as smart as you are about how to use your spare time!)-- are a couple of cool *gadgets* that can help you *see* the point if you haven't already.

Gadget 1 is the "Miller-Sanjurjo Machine" (MSM). MSM is an excel sheet that random generates a sequence of 100 coin tosses. It also keeps track of how each successive toss changes the probability that the next toss in the sequence will be a "heads." By examining how that probability goes up & down in relation to strings of "heads" and "tails," one can see *why* it is **wrong** to simply expect P(H|any specified string of Hs) - P(T|any specified string of Ts) to be zero.

MSM also keeps track of how many times "heads" occcurs after three previous "heads" and how many times "heads" occurs after three previous "tails." If you keep doing tosses, you'll see that *most* of the time P(H|HHH)-P(H|TTT) < 0.

Or you'll likely *think* you see that.

Because you have appropriately trained yourself to *feel *something isn't quite right about that way of proceeding, you'll very sensibly wonder if what you are seeing is real or just a reflection of the tendency of you as a human (assuming you are; apologies to our robot, animal, and space alien readers) to see pattern signals in noise.

Hence, Gadget 2: the "Miller-Sanjurjo Turing Machine" (MSTM)!

MSTM is not really a "Turing machine" (& I'm conflating "Turing machine" with "Turing test")-- but who cares? It's a cool name for what is actually just a simple *statisical simulation* that does 1,000 times what it's baby sister MSM does only once -- that is, flip 100 coins and tabluate the P(H|HHH) & P(H|TTT).

MSTM then reports the *average* difference between the two. That way you can see in fact it's true that P(H|HHH) - P(H|TTT) for sure should be expected to be < 0.

Indeed, you can see exactly how much *less* than 0 we should expect P(H|HHH) - P(H|TTT) to be: about 8%. That amount is the *bias* that was built into the original "hot hand" studies *against* finding a "hot hand."

(Actually, as M&S explain, the size of the bias could be more or less than that depending on the length of the sequences of shots one includes in the sample and the number of previous "hits" one treats as the threshold for a potential "hot streak".)

MSTM is written to operate in Stata. But if you don't have Stata, you can look at the code (opening the file as a .txt document) & likely get how it works & come up with an equivalent program to run on some other application.

Have fun seeing, ratiocinating, and rewiring [all in that order!] your *affective perception* of valid inferences!

## Reader Comments (13)

It's nothing to do with past samples versus future ones, or preselection, or any of those explanations. It's simply because you're averaging fractions of sets with different sample sizes. It's a bit like Simpson's paradox, in that way.

Take their 4-toss example.

The possible outcomes (with tosses following head runs capitalised) are:

hHHH 3/3 = 1

hHHT 2/3 = 0.667

hHTh 1/2 = 0.5

hHTt 1/2 = 0.5

hThH 1/2 = 0.5

hThT 0/2 = 0

hTth 0/1 = 0

hTtt 0/1 = 0

thHH 2/2 = 1

thHT 1/2 = 0.5

thTh 0/1 = 0

thTt 0/1 = 0

tthH 1/1 = 1

tthT 0/1 = 0

ttth 0/0 = Indeterminate

tttt 0/0 = Indeterminate

(Those last two ought to give you a big clue that just adding these numbers up and dividing is not a logically coherent thing to do.)

We see there are 3+2+1+1+1+0+0+0+2+1+0+0+1+0+0+0 = 12 heads following another head, out of

3+3+2+2+2+2+1+1+2+2+1+1+1+1+0+0 = 24 tosses following a head.

That is to say, the probability of a head following a run of heads is exactly 1/2.

However, if you add up the 14 finite fractions and divide by 14, you don't get 1/2. And if you count the fractions over half (4 of them) and those under half (6 of them) we see that when grouping them into sequences of 4 tosses we get more cases where there are more tails following a run of heads than heads. 3/3 is given the same weight as 1/1.

It's like holding an election in which one district votes 5/5 Republican, while 5 districts vote 0/1, 0/1, 0/1, 0/1, 0/1 Republican. Republicans lose in 5 districts and win in only one. The probability of a voter voting Republican is either 5/10 if you count individual voters, or 1/6 if you count them by districts.

The trick works by using wording that makes it look at first glance like it's talking about the 'hot hand' fallacy, when it's actually about an entirely different (and rather artificial-looking) problem.

Cute. But no cigar. :-)

@NiV:

So you'd be willing to lay 27:25 odds against heads on the next flip whenever someone has just flipped 3 heads in a row? I'm sure there are lots of people who will book flights out to see you if you are willing to do this for as long as they are willing to keep tossing. If you are willing to

take27:25 odds against heads being the next outcome after three consecutive heads when we examine past sequences of 100 coin tosses, they'll fly out to wager w/ you on that all day long too (or will if they understand the "'hot hand fallacy' fallacy").However one explains it, though, the M&S result is not a "trick": it's a proof that the "hot hand fallacy"

studieswere wrong to treatP(Success|previous string of successes) - P(Success|previous string of failures) = 0as the null when they analyzed records of NBA players' historical shooting performances to test for the existence of "hot hands." Just as the simulation averages P(Success|previous string of successes) - P(Success|previous string of failures) across 100-toss sequences, so the researchers who conducted those studies averaged P(Success|previous string of successes) - P(Success|previous string of failures) across the past shooting performances (i.e., sequences of shots) of individual NBA players.If you think this is *obviously* "logically incoherent," you should have spoken up before now so that *you* would be getting the credit M&S have earned for showing that studies treated as classics in decision science for 30 yrs do indeed make exactly this logical mistake!

I'm beginning to think it might be easier to plot a distribution of streaks (TTT, TT, T, H, HH, HHH, etc) and compare the real-world distribution to a randomly generated one...

Take a basketball player's season shooting percentage and Monte Carlo a random distribution, and then plot the player's actual distribution.

@Scott:

That's basically 1/2 of what MSTM does: it simulates a coin toss by generating a random number w/ uniform distribution between 0 & 1, treating outcome as "heads" if greater than 0.5 & tails if less; it then does this 100 times for one trial & computes P(H|HHH) - P(H|TTT); then it repeats that process 1,000 times & calculates the mean & SEM for P(H|HHH) - P(H|TTT) for all 1000 trials of 100 tosses.

We could do same for LeBron James: if his succes rate is 0.4 or 0.8 or whatever, just adjust the values for designating a shot as "hit" or "miss" accordingly. We'd then know what we'd expect P(H|HHH) - P(H|TTT)for his past season by chance.

That's an MC simulation, unless you think label "MC simulation" should be reserved for ones that stochastically sets each model parameter to a value randomly selected from within the entire probability density distribution associated with its standard error-- something that would be a weird waste of time for simulating a binomial outcome.

But the other 1/2 of what you propose is to *compare* the past season performance w/ the Monte Carlo simulation.

The question would be, What are looking for?

Presumably P(H|HHH)-P(H|TTT)values that are significantly *higher* than the predicted mean value of P(H|HHH) - P(H|TTT) in the Monte Carlo simulation. If you see that, you'll know that the player enjoyed strings of successes the duration of which exceeded what one expected to see by chance.

That's what M&S do, essentially.

We know from the Monte Carlo simulation that we should expect P(H|HHH) - P(H|TTT) to be significantly

less than zerowhen we are examining a player's past performance (the equivalent of the sequence of 100 past coin tosses). Gilovich, Vallone & Tversky reported values for past performances that had means for P(H|HHH) - P(H|TTT)that were *not significantly different from zero.* Accordingly, they necessarily had P(H|TTT)s -- or probability of success following immediate strings of successes-- that were significantly *greater* than what one would have expected if the players' performances were generated by a random process.So M&S conclude that GVT's own data support the inference of "hot hands."

Sound good to you?

Oh, btw, we could do the simulation varying P(H), the number of outcomes in a given sequence, and the number of immediately preceding successes to model a player of any particular proficiency & hypothesized "streakiness." M&S present a set of simulations that vary in those ways. The key point is their conclusion (which they derive mathematicaly) that in any "finite sequence generated by repeated trials of a Bernoulli random variable the expected conditional relative frequency of successes, on those realizations that immediately follow a streak of successes, is strictly less than the fixed probability of success" (p. 22).

"If you are willing to take 27:25 odds against heads being the next outcome after three consecutive heads when we examine past sequences of 100 coin tosses, they'll fly out to wager w/ you on that all day long"They'd lose. The probability of the next outcome being heads after three consecutive heads is 0.5, even when examining past sequences of 100 coin tosses.

The problem is that in taking blocks of 100 coin tosses, you're getting a variable number of runs in each of them, so your sample sizes are different. To work out the probability of heads, you have to count *all* the instances of heads over all sub-samples and divide by the count of *all* the runs. You can't take them in uneven-sized groups, work out an average for each, and then average the averages. It's wrong.

I've implemented something a bit like your example in R, to show how to work out the probability correctly.

# ---------- Start of R ------------

# This function tosses a coin 100 times, and counts

# the number of instances of HHHH and HHHT in the sequence.

run = function() {

toss100 = paste(ifelse(rbinom(100,1,0.5),"H","T"),collapse="")

subs4 = sapply(1:96,function(i)substring(toss100,i,i+3))

freqtable = table(subs4)

return(c(freqtable["HHHH"],freqtable["HHHT"]))

}

# ----------------------

# This function does the above experiment 1000 times and amalgamates the results

krun = function() {

runs = data.frame(t(replicate(1000,run())))

runs$n = runs$HHHH + runs$HHHT # Calculate sample size

runs$p = runs$HHHH / runs$n # Prob for individual run - with different sample sizes

runH = sum(runs[,1],na.rm=T) # Total instances of HHHH over all runs

runT = sum(runs[,2],na.rm=T) # Total instances of HHHT over all runs

return(c(TotRunH=runH,TotRunT=runT,PHeads=runH/(runH+runT),MeanP=mean(runs$p,na.rm=T)))

}

# ----------------------

# Do the 1000-runs experiment

krun() # You can repeat this line as often as you like

# -------- End of R --------------

To get an idea of what it's doing, this shows the steps:

# --------- Start of R -------------

# For full marks, show your working...

(toss100 = paste(ifelse(rbinom(100,1,0.5),"H","T"),collapse=""))

(subs4 = sapply(1:96,function(i)substring(toss100,i,i+3))) # Substrings of length 4

(freqtable = table(subs4)) # Count how many of each

# Each entry of the above table is distributed Poisson(100/16), the mean is the same for all

c(freqtable["HHHH"],freqtable["HHHT"]) # Pick out just the two numbers we want

runs = data.frame(t(replicate(1000,run())))

runs$n = runs$HHHH + runs$HHHT # Calculate sample size

runs$p = runs$HHHH / runs$n # Prob for individual run - with different sample sizes

head(runs,20) # Show first 20 lines out of the thousand

# The first two columns above show the number of HHHH and HHHT in each run

# The third column shows the sample size, the fourth column shows the estimate of p

# for that single hundred-toss sample

# Find totals for the first two columns

runH = sum(runs[,1],na.rm=T) # Total instances of HHHH over all runs

runT = sum(runs[,2],na.rm=T) # Total instances of HHHT over all runs

# Find probability counting all instances equally (PHeads) and find average of run averages

# Because sample sizes are different, this gives wrong answer

# (a + b + c + ...) / (d + e + f + ...) is *NOT* equal to a/d + b/e + c/f + ...

c(TotRunH=runH,TotRunT=runT,PHeads=runH/(runH+runT),MeanP=mean(runs$p,na.rm=T))

# --------- End of R -------------

The value PHeads above shows that the probability of H following an instance of HHH is still 0.5, as expected. The 'hot hand' fallacy is still a fallacy.

I don't know whether any of the previous 'classic' studies made the same error of averaging averages, since I've not read them, but I'd advise treating this claim with caution. And be *very* careful about precise wording.

Minor correction - I just realised I'm not treating zero counts correctly.

You would have to insert the lines

runs$HHHH[which(is.na(runs$HHHH))] = 0

runs$HHHT[which(is.na(runs$HHHT))] = 0

immediately after the lines starting "runs =..."

This makes no difference to the correct calculation, since NAs are zeroed there anyway, but it fills in a few more entries in the final 'p' column when HHHH is zero but HHHT is not.

It makes no material difference to the result, but I like to get things right. :-)

@NiV:

Well, humor me then.

A.Do you agree that whether the “hot hand fallacy” *is* a fallacy is an empirical issue? That is, it’s an empirical issue whether some or all players do in fact perform above their average levels of proficiency for periods the duration of which exceed what we would expect to see by chance?If so, then nothing either of us (or anyone else) does with probability theory or simulations or logic etc will actually settle whether people’s perceptions of “hot hands” reflect their tendency to perceive patterns in noise; it won’t be possible to “prove” the “hot hand fallacy” or “disprove” it by those means.

Right?

I find it pretty unlikely you’ll disagree with this, but I don’t want to assume anything, so pls tell me.

B.If one wants to test the “hot hand” conjecture (let’s call it) empirically, then one has to figure out what sort of observations would support the inference that it is true or false.In other words, we need to figure what sort of continous period of above-avergage performance we should expect to see by chance in any given player. That will be our null hypothesis. We can then collect actual data & determine, using the appropraite statistical test, whether we in fact observe any players whose continuous periods of above-(their)-average performance allow us to “reject the null” at whatever specified level of “significance” we choose.

What the null hypothesis actually is is the *only* issue here. The *only* point of the math in the M&S paper, & of my attempts to identify conceptually what their math is showing, and of your & my simulations etc., is to figure out what sorts of continuous periods of above-average performance are consistent with chance, so that we can set an appropriate value for the null hypothesis when we examine data.

I think you are likely in agreement with me here too, but am a bit less sure than I was on (A).

The reason I’m less sure is your suggestion in your first comment that M&S are performing a

Likewise, the agitation reflected in this statement in your latest comment:

Taking a finite sequence of coin tosses is one way to explore the issue of what the “null” is here.

A coin toss is a model of the performance of an athlete whose average level of proficiency is 50%. In that model, a sequence of 100 coin tosses is a sample of observations akin, say, to a season’s worth of performances by a particular athlete.

If we can figure out what P(H|HHH) is for 100 coin tosses, then we can figure out what the “null” is for an athlete under those conditions. If we observe a player's P(success|string of 3 successes) is significantly greater than that, then we can reject the null—and treat the observations as supporting the inference that the player in fact displayed a “hot hand” or a level of proficiency that exceeded his or her average level of performance for a particular period of time.

Again, I suspect you probably agree with me here. But if you don’t, then you actually aren’t addressing the issue that M&S are addressing & that I am commenting on. Because all we are interested in is figuring out when we can conclude that we have observed a string of successes that exceed the number of consecutive successes we'd expect to see by chance within a particular sample.

C.The “hot hand” studies assumed that P(H|HHH) = P(H|TTT). Accordingly, P(H|HHH)-P(H|TTT) = 0 was *their* null.The only issue is whether they were correct about that—and if not, how that affects the inferences to be drawn from their data.

M&S say they weren’t right.

Their math, the logic of which I’m trying to formulate in terms that are faithful to it and that enable people to “get” M&S’s proof conceptually, is that

P(H|HHH)-P(H|TTT) < 0in a model that consists of flipping “heads” in 100-toss sequences.Are

M&Sright?I think you actualy both agree and disagree with them—but not in a way akin to the Kentucky Farmer, who both believes & disbelieves in climate change. He isn’t really contradicting himself, but I think you are.

Tell me which of these statements, if any, you disagree with:

1.If we examine a sequence of 100 coin tosses, for any of the tosses we expect P(H) = 0.50.2.If we examine the tosses immediately after 3 consecutive heads & 3 consecutive tails within the sequence, wewon't findP(H|HHH) + P(H|TTT) > 1.0.3.If we confine our attention to only those tosses in a 100-flip sequence that occur immediately after three consecuitive Hs and three consecuitive Ts, we’ll find P(H|HHH) - P(H|TTT) < 0. That is, P(H|TTT) > P(H|HHH).If you agree with three statements, you are contradicting yourself when you say that

If P(H|HHH) = 0.50, & [3] P(H|TTT) > P(H|HHH), then P(H|HHH) + P(H|TTT) > 1.0, which would contradict [2].

So I’m guessing you still don’t fully believe [3]—that P(H|HHH)-P(H|TTT) < 0.

Yet you actually do *explain why* P(H|TTT) < 0.50 in completed sequences of coin tosses! Multiple times even.

You do it in your first comment when you explain why we expect P(H|H) < P(H|T) within completed 4-toss sequences:

Your second comment shows that you have in fact discovered the same thing happens when one examines “blocks of 100 coin tosses.” That’s what prompts you to complain

It’s not “wrong” (?!) to do this – it’s just a sampling strategy that guarantees P(H|TTT) > P(H|HHH) < 0.50 in a 100-flip sequence of coin tossses.

What you say is “wrong” & what you are “correcting” by constructing a program that *doesn’t* simulate the mean value of P(H|HHH)-P(H|TTT) in sequences of 100-flip coin tosses

isthe analytical strategy of the “hot hand” researchers!M&S in fact articulate the conceptual basis of their proof in terms similar to yours. They point out what the “hot hand” researchers’ sampling strategy “ignores” is that “conditioning on a streak of hits within a sequence of finite length . . . creates a selection bias towards observing shots that are misses”—because in fact one is then calculating the probability of success in a sample that has a smaller proportion of “hits” left than one that conditions on a streak of failures.

You are, as far as I can tell, reproducing M&S’s argument & then accusing *them* of being idiots --proclaiming that they are engaged in a “trick”; are “using wording that makes it look at first glance like it's talking about the 'hot hand' fallacy, when it's actually about an entirely different (and rather artificial-looking) problem”; are doing something plainly “wrong” by simply calcaulting the mean P(H|HHH) & P(H|TTT) in a large sample of 100-flip sequences etc.—for having made it!

That makes me suspect you haven’t even taken the time to figure out what they are doing: constructing a proof intended to show that methods used in iconic decision science studies are statistically “biased.”

Here’s a last point form their paper:

So . . . Send me your address, & I’ll send you a first-class plane ticket to Las Vegas & put you up at the Venetian on the condition that you agree to to bet $100 a pop on “heads” being the next outcome after any three consecutive heads—with me laying you juicy odds of 27:25—when we examine as many past sequences of 100-coin tosses as I care to generate.

Because of the “sampling bias” that you yourself note, your EV in that situation will be -$4.32 per bet (the probability of the next flip being “heads” in that situation being only 0.46).

If you get tired of the game, we can take short breaks to play poker, on condition that you play as if the odds against hitting your four-card flush on the river were only 3:1.

"Well, humor me then."Sure. Don't I always? :-)

"Do you agree that whether the “hot hand fallacy” *is* a fallacy is an empirical issue?"Yes. There are two separate aspects to it: whether successive shots made by basketball players are statistically independent, and why people expect what they expect when it comes to runs.

The first is an empirical question of contingent fact. There's no logical or physical reason why successive shots necessarily *have* to be independent, and in fact my initial assumption would probably be that they weren't. Players have good days and bad days, confidence and stress likely have an effect on concentration and accuracy, and so on. If you suspect such reasons, then without examining the statistics you couldn't tell otherwise, and the belief is not unreasonable. It turns out as a matter of empirical observation that the shots actually *are* nearly independent, but they might not have been. The belief may fairly be called a myth, but it's not actually a fallacy.

The second issue revolves around the theory some people have that runs affect subsequent probabilities for purely *statistical* reasons - generally some misunderstood version of the law of large numbers. People often think "the law of averages" means that if the outcomes swing one way for a while, then they're more likely to subsequently swing the other way to move it back to the long-term average. They figure that the only way a random process can always approach an average is if it's steered there, so the probability on each toss is variable, and depends on past behaviour. The gambler's fallacy figures that if it's drifted away, then the odds will have shifted to steer it back. The hot hand fallacy is that the odds must have shifted to steer it away in the first place, and it may be some time yet before the law of averages acts to steer it back.

This belief that the law of averages comes about by means of the odds changing to steer the outcome towards the long-term average is the fallacy - a belief or argument that psychologically appears sound but is in fact incorrect.

"In other words, we need to figure what sort of continous period of above-avergage performance we should expect to see by chance in any given player. That will be our null hypothesis.""By chance" is imprecise. Even a biased random process is still random. The question is whether the outcomes are consistent with statistically independent trials with constant probability of success on each trial. There are plenty of statistical tests for checking that.

"The *only* point of the math in the M&S paper, & of my attempts to identify conceptually what their math is showing, and of your & my simulations etc., is to figure out what sorts of continuous periods of above-average performance are consistent with chance, so that we can set an appropriate value for the null hypothesis when we examine data."Yes.

"Taking a finite sequence of coin tosses is one way to explore the issue of what the “null” is here."I've got no problem with that.

"A coin toss is a model of the performance of an athlete whose average level of proficiency is 50%. In that model, a sequence of 100 coin tosses is a sample of observations akin, say, to a season’s worth of performances by a particular athlete."OK.

"If we can figure out what P(H|HHH) is for 100 coin tosses, then we can figure out what the “null” is for an athlete under those conditions. If we observe a player's P(success|string of 3 successes) is significantly greater than that, then we can reject the null"Agreed.

"The “hot hand” studies assumed that P(H|HHH) = P(H|TTT). Accordingly, P(H|HHH)-P(H|TTT) = 0 was *their* null."What they ought to have done is observe that if the 'statistically independent, constant probability' was true, then P(H|HHH) would equal P(H|TTT). So yes, that's their null hypothesis.

"Are M&S right?"No. Or at least, not about that.

Some of their 'alternative formulations' of the problem, like betting on a ticket that pays out if the number of run-following heads in a sample exceeds the corresponding number of tails, they *are* correct on, but they're actually not equivalent. But on the main question, no I don't think they're correct.

"I think you actualy both agree and disagree with them—but not in a way akin to the Kentucky Farmer’s belief & disbelief in climate change. He isn’t really contradicting himself, but I think you are."Hurrah! Very pleased to see you say that! (That the Kentucky Farmer isn't contradicting himself, that is.)

"Tell me which of these statements, if any, you disagree with:"Number 3.

P(H) = P(H|HHH) = P(H|TTT) = 0.5

"Yet you actually do *explain why* P(H|TTT) < 0.50 in completed sequences of coin tosses!"Yes. I also explained why in the example election the probability of a voter voting Republican was simultaneously 5/10 and 1/6! If you calculate the probability by averaging averages [(5/5 + 0/1 + 0/1 + 0/1 + 0/1 + 0/1)/6] then you get one answer. If you total the *individual* successes and *individual* voters over all samples first, before dividing [(5+0+0+0+0+0) / (5+1+1+1+1+1)] you get the other answer.

It's a numerical fact that (5/5 + 0/1 + 0/1 + 0/1 + 0/1 + 0/1)/6 = 1/6. The probability of a *district* voting Republican is unequivocally 1/6. I'm not disagreeing with the arithmetic. I'm disagreeing with the claim that this is how you ought to calculate the probability of a *voter* voting Republican - or of P(H|HHH) when tossing coins. The 'correct' probability (for the given question) is unequivocally a 5/10.

The problem is that you're combining samples with different sample sizes giving them equal weight, when the value being observed is correlated to the sample size (and from a highly skewed distribution). If you average them weighting them correctly, in proportion to the sample size, you get the right probability.

"It’s not “wrong” (?!)to do this – it’s just a sampling strategy that guarantees P(H|TTT) > P(H|HHH) < 0.50 in a 100-flip sequence of coin tossses."In the same sense that it's not "wrong" to assess the probability of a voter voting Republican by counting districts. It's just a sampling strategy that guarantees P(Dem) > P(Rep) < 0.5! :-)

"You are, as far as I can tell, reproducing M&S’s argument & then accusing *them* of being idiots -proclaiming that they are engaged in a “trick”"Climate scientists tell us that "trick" just means "a clever way to do things". No offence was intended. ;-)

"You are, as far as I can tell, reproducing M&S’s argument & then accusing *them* of [...] doing something plainly “wrong” by simply calcaulting the mean P(H|HHH) & P(H|TTT) in a large sample of 100-flip sequences etc."That's right. Like I'm claiming the vote-counters are doing something "plainly wrong" by simply calculating the mean district election outcome in a large sample of districts.

"That makes me suspect you haven’t even taken the time to figure out what they are doing..."I think you might fairly suspect me of having *misunderstood* what they're doing, but given the amount of code I've just written I don't think it's fair to suspect me of not having taken the time.

Anyway, I'm only doing this for fun, because I found it a genuinely interesting problem to figure out what was going on and where either they or I had gone wrong. Normally, I demand payment before I write code!

"Here’s a last point form there paper:"As noted earlier, they're correct on that one. If you want the payout to depend on the 'hot hand' probability, you would have to pay $1 for each HH and subtract $1 for each HT. So HHHH pays $3 because there are three runs to check, while TTHH only pays $1. If you just look at which wins in each block of 4, it's like asking who wins in each district election. Democrats win 5/6 elections, but if the one election the Republicans win pays me $5 while each of the others costs me $1, I come out even.

But anyway, thanks for the interesting problem. It's been entertaining.

@NiV--

So are we going to meet in Vegas?

Read GVT -- & see if you think *they* are doing what you say shouldn't be done (I think indeed they are, or in any doing what shouldn't be done when they treat P(H|HHH)-P(TTT|H) as the null given the observations they are making).

Then consider what for me is the most interesting problem associated with this problem: how could this have happened? How could not one have noticed for 30 yrs, during which time the studies achieved a canonical status with decision science corpus?

@Dan

I guess I was thinking of avoiding the conditional P(x|y), simply as a way to do it *differently*. Just plot a histogram/distribution of streaks, and don't worry about what the next shot in each streak sequence was.

I'd be interested in seeing the differences between players (each compared to their randomized distribution). Are the real distributions pretty much identical to the random ones? Are they wider and flatter, suggesting both an excess of "hot hands" and "cold hands"? Are they asymmetric, with an excess of "hot hands" but NOT "cold hands"? Do the best players look like one of those options, and average players look like another?

Given the huge leaps in analytics technology in basketball over the last few years (i.e., http://grantland.com/features/the-toronto-raptors-sportvu-cameras-nba-analytical-revolution/), it shouldn't even be difficult for someone to do for a large number of players. (And a guy like LeBron gets up well over 1,000 shots in a season, so you've got decent samples.) Calling FiveThirtyEight!

@Scott:

You should be able to simulate distributions that have the properties you are interested in & plot them in a form that let's you see how they differ from each other.

I'm not sure how to do it; maybe @NiV would know.

But what you want to do is model a binary process in which M, the mean number of n or more consecutive successes given a particular probability of success & specified number of trials, exceeds the value of M that would be generated by a random binary process with same probability of success & number of trials. Necessarily in the former process, successes & failures can't be independent...

Presumably this is all very basic stuff for statisticians!

"Read GVT -- & see if you think *they* are doing what you say shouldn't be done"Hmm. It's hard to say. They apply several different tests, most of which should be immune to the effect. But at one point they do indeed compare percentages taken from samples with different sample sizes. However, all they say is that the differences were not significant without detailing how they calculated the test boundaries, so I can't tell if they took the differing sample size into account.

On the other hand, they do speak several times of a "binomial" model and later on in the paper they appear to do a similar calculation applying a Z test, which strongly suggests they're using the observed sample size to determine the binomial (approximated as Normal) distribution to test. It's not obvious to me that when taken from the same sequence the distributions are independent, as the Z test requires, so I'd have to say I'm unsure.

It's unfortunate that the exigencies of writing papers for space-constrained journals tends to lead authors into omitting the routine details of such calculations. It's this sort of issue that makes me argue for publishing code and raw data alongside the papers. It would be much easier to figure out what was actually done, and so for readers to check the result (which is - at least theoretically - the primary purpose of journal publication, after all), if this became routine.

Of course, if the primary purpose of publishing is to get grants and tenure, then the current approach is much easier to explain!

"Then consider what for me is the most interesting problem associated with this problem: how could this have happened? How could not one have noticed for 30 yrs, during which time the studies achieved a canonical status with decision science corpus?"Heh! You'll get no argument from me there!

There is now a large pile of papers, reports, and databases that we know are full of far less subtle bugs and errors, but which passed top-level peer review, scrutiny by the scientific community for several years, followed by multiple further layers of review by international panels of experts without the problems with them being detected. It was only when outsiders became

motivatedto take a closer look - this is of course climate science we're talking about here - that the parlous state of affairs was discovered.Given what I know of the review process, it doesn't actually surprise me all that much. It's been reported in the peer-reviewed literature that more than half of all published results are subsequently found to be incorrect. What I find more disturbing and inexplicable is when a substantial part of the scientific community doggedly continues to support results even

afterthe flaws in the work have been exposed. They don't object to the flawed results continuing to be cited and used to support policy, and they don't object to other scientists not objecting! That's seriously weird!(It would be cool if social scientists and historians of science were interested in researching how and why that happened, but they appear to be affected by the phenomenon too. Maybe they will in the more distant future, in hindsight...)

I don't actually have a problem with the idea of a statistics paper making an error that remained undetected for a long time - statistics is full of subtleties, and as I say, journal papers are more 'work in progress' than 'settled science'. I've got no particular attachment to this particular result. But I can't say I'm convinced yet. The counter-argument needs sharpening up, to make it clearer exactly where and how it is applicable to the methods used in the original papers.

I think it

isfair to say that the original papers omit a lot of details justifying their methods, which ought to be checked. Right or wrong, I applaud the authors of the present paper for making the attempt.@NiV:

For sure they were applying null tests that assume binomial distributions. But they didn't understand either (a) that events aren't genuinely independent when you sample w/o replacement from fixed number of random binary outcomes (& so don't follow binomial distribution) or more likely (b) that they were in fact sampling w/o replacement.

What is P(H) after three consecutive Hs and P(H) following three consecutive Ts if examine a sequence of 100 coin flips? Another coin flip isn't the right model to use to answer that question; b/c P(H) in a past sequence depends on proportion of Hs & Ts remaining.

The math is pretty hard, I think, for this problem.

But a monte carlo simulation suggests the mean for P(H|HHH) is is 0.45 +/- .01 at 0.95 LC.

The monte carlo simulation just *is* averaging or finding mean P(H|HHH) for specified number of trials -- 1000 is plenty to populate the entire probability density distribution associated with whatever the mean and standard error is for P(H|HHH) in a 100 trial coin toss.

That's all that M&S are doing when they "average" to find the P(H|H) in a four-toss sequence. for a four-toss sequence, one can easily interrogate the entire sample space; b/c each 4-toss combination is as likely as every other, we know we'll get 0.4 as

meanP(H|H) for four tosses if we simulate P(H|H) in 4-toss sequence by performing that trial 1,000 times.That won't be the answer to any question in which P(H) is independent of the previous toss. But P("hit") is not indepenent of previous 3 shots if we are examining a sample consisting of a fixed number of shots -- any more then P(♣) is independent of how many ♣'s & non-♣'s have already been dealt from a deck of cards

Yes, mistakes get made in papers all the time.

But *this* is not a randomly selected paper. It is one that was conducted by decision scientists & celebrated by members of that very field for 30 yrs as a wonderful empirical proof of biases that interfere w/ the capacity of people to process information about independent binary outcomes; proof of the error is one that is fiercely resisted by people whose intuitions about probability have been fine tuned by training & experience ...

The sort of bias that is at work here is itself very much worthy of investigation by decision scientists