follow CCP

Recent blog entries
popular papers

Science Curiosity and Political Information Processing

What Is the "Science of Science Communication"?

Climate-Science Communication and the Measurement Problem

Ideology, Motivated Cognition, and Cognitive Reflection: An Experimental Study

'Ideology' or 'Situation Sense'? An Experimental Investigation of Motivated Reasoning and Professional Judgment

A Risky Science Communication Environment for Vaccines

Motivated Numeracy and Enlightened Self-Government

Making Climate Science Communication Evidence-based—All the Way Down 

Neutral Principles, Motivated Cognition, and Some Problems for Constitutional Law 

Cultural Cognition of Scientific Consensus

The Tragedy of the Risk-Perception Commons: Science Literacy and Climate Change

"They Saw a Protest": Cognitive Illiberalism and the Speech-Conduct Distinction 

Geoengineering and the Science Communication Environment: a Cross-Cultural Experiment

Fixing the Communications Failure

Why We Are Poles Apart on Climate Change

The Cognitively Illiberal State 

Who Fears the HPV Vaccine, Who Doesn't, and Why? An Experimental Study

Cultural Cognition of the Risks and Benefits of Nanotechnology

Whose Eyes Are You Going to Believe? An Empirical Examination of Scott v. Harris

Cultural Cognition and Public Policy

Culture, Cognition, and Consent: Who Perceives What, and Why, in "Acquaintance Rape" Cases

Culture and Identity-Protective Cognition: Explaining the White Male Effect

Fear of Democracy: A Cultural Evaluation of Sunstein on Risk

Cultural Cognition as a Conception of the Cultural Theory of Risk

« "Non-replicated"? The "motivated numeracy effect"?! Forgeddaboutit! | Main | Science literacy & polarization--what replication crisis? »

The earth is (still) round, even at P < 0.005

In a paper forthcoming in Nature Human Behavior (I think it is still “in press”), a large & distinguished group of social scientists propose nudging (shoving?) the traditional NHST threshold from p ≤ 0.05 to P ≤ 0.005. A response to the so-called “replication crisis,” this “simple step would immediately improve the reproducibility of scientific research in many fields,” the authors (all 72 of them!) write.          

To disagree with a panel of experts this distinguished & this large is a daunting task.  Nevertheless, I do disagree.  Here’s why:

1. There is no reason to think a p-value of 0.005 would reduce the ratio of valid to invalid studies; it would just make all studies—good as well as bad—cost a hell of a lot more.

The only difference between a bad study at p ≤ 0.05 and a bad study at p ≤ 0.005 is sample size.  The same for a good study in which p ≤ 0.005 rather than p ≤ 0.05.

What makes an empirical study “good” or “bad” is the quality of the inference strategy—i.e., the practical logic that connects measured observables to the not-directly observables of interest.

 If a researcher can persuade reviewers to accept a goofy theory for a bad study (say, one on the impact of “himmicanes” on storm-evacuation advisories, the effect of ovulation on women’s voting behavior, or the influence of egalitarian sensibililties on the rate of altercations between economy class and business class airline passengers) at p ≤ 0.05, then the only thing that researcher has to do to get the study published at p ≤ 0.005  is collect more observations.

Of course, because sample recruitment is costly, forcing researchers to recruit massive samples will make it harder for researchers to generate bad studies.

But for the same reason, a p ≤ 0.005 standard will make it much harder for researchers doing good studies---ones that rest on plausible mechanisms—to generate publishable papers, too.

Accordingly, to believe that p ≤ 0.005 will improve the ratio of good studies to bad, one has to believe that scholars doing good studies will be more likely to get their hands on the necessary research funding than will scholars doing bad studies.

That’s not particularly plausible: if it were, then funders would be favoring good over bad research already—at p ≤ 0.05.

At the end of the day, a p ≤ 0.005 standard will simply reduce the stock of papers deemed publishable—period—with no meaningful impact on the overall quality of research.

2. It’s not the case that a p ≤ 0.005 standard will “dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible.”

The mistake here is to think that there will be fewer borderline studies at p ≤ 0.005 than at p ≤ 0.05.

P is a random variable.  Thus, if one starts with a p ≤ 0.05 standard for publication, there is a 50% chance that a study finding that is “significant” at  p = 0.05 will be “nonsignificant” at p = 0.05 on the next trial, even assuming both studies were conducted identically & flawlessly. (That so many replicators don’t seem to get this boggles one’s mind.)

If the industry norm is adjusted to  p ≤ 0.005, we’ll simply see another random distribution of p values, now around the mean of p ≤ 0.005.  So again, if a paper reports a finding at p = 0.005, there will be a 50% chance that the next, replication trial will produce a result that's not significant at p < 0.005. . . .

Certifying reproducibility won’t be any “easier” or any more certain. And for the reasons stated above, there will be no more reason to assume that studies that either clear or just fall short of clearing the bar at p ≤ 0.005 are any more valid  than ones that occupy the same position in relation to p < 0.05.

3. The problem of NHST cannot be fixed with more NHST.

Finally and most imporantly, the p ≤ 0.005 standard misdiagnoses the problem behind the replication crisis: the malignant craft norm of NHST.

Part of the malignancy is that mechanical rules like p ≤ 0.005 create a thought-free, “which button do I push” mentality: researchers expect publication for research findings that meet this standard whether or not the study is internally valid (i.e., goofy) .  They don’t think about how much more probable a particular hypothesis is than is the null—or even whther the null is uniquely associated with some competing theory of the obsrved effect.

A practice that would tell us exactly those things is better not only substantively but also culturally, because it forces the researcher to think about exactly those things.

Ironically, it is clear that a substantial fraction of the “Gang of 72” believes that p-value-driven NHST should be abandoned in favor of some type of “weight of the evidence” measure, such as the Bayes Factor.  They signed on to the article, apparently, because they believed, in effect, that ratcheting up (down?) the  p-value norm would generate even more evidence of the defects of any sort of threshold for NHST, and thus contribute to more widespread appreciation of the advantages of a “weight of the evidence” alternative.

All I can say about that is that researchers have for decades understood the inferential barenness of p­-values and advocated for one or another Bayesian alternative instead.

Their advocacy has gotten nowhere: we’ve lived through decades of defective null hypotheses testing and the response has always been “more of the same.”

What is the theory of disciplinary history that  predicts a sudden radicalization of the “what button do I push” proletariat of social science? 

As intriguing and well-intentioned the p ≤ 0.005 proposal is, arguments about standards aren’t going to break the NHST norm.

“It must get worse in order to get better” is no longer the right attitude.

Only demonstrating the superiority of a “weight of the evidence” alternative by doing it—and even more importantly teaching it to the next generation of social science researchers—can really be expected to initiate the revolution that the social sciences need.   



PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (37)

Dan -

A couple of questions. If possible, could you answer in relatively simple terms....?

There is no reason to think a p-value of 0.005 would reduce the ratio of valid to invalid studies;

Really? You don't think it would change the ratio at all?

The only difference between a bad study at p ≤ 0.05 and a bad study at p ≤ 0.005 is sample size.

OK, I think maybe I get that.

But not to pretend that I understand the technical issues at play, but just for clarification, does that then imply that you are arguing that effectively increasing the mean sample size in studies that are published would not result in a reduction in the % of studies published that report a "false positive?"

Leaving aside the statistical debate, does it mean that you rule out the possibility that simply requiring larger sample sizes (assuming that the only potential change in response to a more stringent p value requirement would be increases in sample size) would make it more likely that methodological flaws would be discovered?

August 23, 2017 | Unregistered CommenterJoshua

Joshua-- p values get increasing small as the sample enlarges but w/o adding anything of inferential value as they do. One has to use some sort of weight of the evidence measure--like Bayes factor-- as a result. See Gateway Illusion for more discussion

August 24, 2017 | Registered CommenterDan Kahan

"A couple of questions. If possible, could you answer in relatively simple terms....?"

We are searching the haystack of false hypotheses for the needle of truth. We pick up a handful of haystack, put it under our detector, and see if it beeps. Unfortunately, the metal detector beeps one in twenty times even if there is nothing there, and the haystack is contaminated with metallic grit.

So there are people who are picking up random bits of haystack and testing them, and every time it beeps (which is at least one time in every twenty) they publish a paper. The odds of them picking the needle versus a bit of hay is not generally one in twenty - it depends simply on how big the haystack is, and how small the needle. If the haystack is big enough, 99.9% of the papers published may be false alarms. They're one in twenty of the attempts, but there are an unlimited number of attempts.

Now suppose you turn up the sensitivity on the metal detector. Now only one in two hundred attempts with nothing in them beeps. But with that metallic grit in there too, the detector is more apt to be sensitive enough to detect it and go off the more you turn it up. There's no change in the odds of picking the needle at random when you grab your handful of hay. There's a ten-fold decrease in the rate at which the detector beeps with nothing in the scoop, but tests are a lot quicker to do than papers published, so there's probably not that much of a reduction in the rate of false alarm papers, and you're a lot more likely to be detecting grit. Also, there's also still the same percentage of incompetent and dishonest researchers who make errors operating the detector, which increasing the sensitivity also doesn't change.

Eventually, the sensitivity is so high it's picking up the grit contamination in every handful, and the test becomes useless. It's beeping every time, whether there's a needle of truth there or not. If you've optimised your testing sensitivity to give the best results, and then turn the dial up higher, there's nowhere to go but worse. If you're off the optimum, then one way it gets better and the other way it gets worse and there's no reason to think 'better' is necessarily in the direction of higher sensitivity rather than lower.

August 24, 2017 | Unregistered CommenterNiV

link drop:

Our observations of the 2016 election are inconsistent with a symmetric polarization hypothesis. Instead, we see a distinctly asymmetric pattern with an inflection point in the center-right—the least populated and least influential portion of the media spectrum. In effect, we have seen a radicalization of the right wing of American politics: a hollowing out of the center-right and its displacement by a new, more extreme form of right-wing politics. During this election cycle, media sources that attracted attention on the center-right, center, center-left, and left followed a more or less normal distribution of attention from the center-right to the left, when attention is measured by either links or tweets, and a somewhat more left-tilted distribution when measured by Facebook shares. By contrast, the distribution of attention on the right was skewed to the far right. The number of media outlets that appeared in the center-right was relatively small; their influence was generally low, whether measured by inlinks or social media shares; and they tended to link out to the traditional media—such as the New York Times and the Washington Post—to the same extent as did outlets in the center, center-left, and left, and significantly more than did outlets on the right. The number of farther-right media outlets is very large, and the preponderance of attention to these sources, which include Fox News and Breitbart, came from media outlets and readers within the right. This asymmetry between the left and the right appears in the link ecosystem, and is even more pronounced when measured by social media sharing.

August 24, 2017 | Unregistered CommenterJonathan

"In effect, we have seen a radicalization of the right wing of American politics: a hollowing out of the center-right and its displacement by a new, more extreme form of right-wing politics."

That fits. From what I've seen, the American political elites occupied the centre-left and left on one side, and the centre on the other, with no real representation for the centre-right or right (what they call the "RINO" phenomenon). At the same time, the mainstream media (what the report laughably describes as "long-standing media organizations steeped in the traditions and practices of objective journalism") were partisan left and centre-left, with a couple of outliers in the centre-right like Fox.

The change triggered by the Trump campaign was that the mainstream media became more openly partisan for Hilary and against Trump, legitimising a sort of frenzied, paranoid 'Emanuel Goldstein-esque' hate campaign in the alt-left, and where even the centre-right political elites in the Republican party joined in too. The voters on the right lost patience with both the Republican nominally-centre-right elites and the "lying, partisan media", and turned instead to alternative news sources. The far right, who had long been excluded and ignored from the political debate, exploded very visibly onto the scene. And their visibility was enhanced even further when the partisan left-wing mainstream media picked this up as yet another stick to beat Donald Trump with, talking up the dangers.

I found figure 3 in the report fascinating. Virtually all the top coverage in the campaign consisted of frenzied attacks on Donald Trump! As they say, he and what he said dominated the news! I think it was partly the ferocity and openness of the media campaign that so alarmed the disenfranchised voters of the centre-right, and led them to shift further towards the anti-elite, anti-establishment, revolutionary stance. The vitriolic anti-Trump campaign made the alt-right paranoia seem more credible.

I presume you brought it up because it mentioned a symmetry hypothesis in partisanship. I don't think it's the same as the symmetry thesis we usually discuss here, which is that the cognitive capabilities and bias mechanisms on left and right are the same, not that similar capabilities always give rise to identical outcomes. Clearly, the left and right cultural groups have their own cultural dynamics. Left-wingers and right-wingers in the Stalinist Soviet Union had the same basic cognitive capabilities, but existed in radically different cultural and social contexts. The Stalinists did not rule society because their brains were any different.

August 25, 2017 | Unregistered CommenterNiV

Dan -

So you say this...

p values get increasing small as the sample enlarges but w/o adding anything of inferential value as they do. One has to use some sort of weight of the evidence measure--like Bayes factor-- as a result.

And indeed, the paper says this:

The proposal does not address multiple hypothesis testing, P-hacking, publication bias, low power, or other biases (e.g., confounding, selective reporting, measurement error), which are arguably the bigger problems. We agree. Reducing the P-value threshold complements—but does not substitute for—solutions to these other problems, which include good study design, ex ante power calculations, pre-registration of planned analyses, replications, and transparent reporting of procedures and all statistical analyses conducted.

and presents this as a counterargument....

Changing the significance threshold is a distraction from the real solution, which is to replace null hypothesis significance testing (and bright-line thresholds) with more focus on effect sizes and confidence intervals, treating the P-value as a continuous measure, and/or a Bayesian method.

And I sorta, kinda, maybe get all of that to at least some extent...

And I get that the marginal improvement of a higher p value bar brings ever vanishing returns in any practical sense, And I get that more stringent p values come at a cost...

But what I still can't wrap my head around is the idea that reducing MOE and narrowing confidence interval won't be accompanied by reduced false positives, to any extent, (as you argue).

And I just can't imagine where increasing sample sizes won't bring any benefit wrt false positives - if only because it might help researchers to see conflating effects that they overlooked previously (because of small sample size!)

Guess it's just one of those things I'll have to accept that I don't understand. There is long list and it grows unrelentingly.

August 25, 2017 | Unregistered CommenterJoshua

link drop:
( is paywalled, will look for a draft version...).

August 25, 2017 | Unregistered CommenterJonathan

So while it certainly seems that people are inclined towards forming beliefs in line with groups affiliation, it also seems to me that the process of affiliation is, more or less, arbitrary. Not being convinced by "asymmetry," I'm dubious that there's some genetic predisposition for certain people to align with certain ideologies because of some innate compatibility towards those ideas. And for the same reason, I don't think that people affiliate because of moral or value distinctions (and at any rate, IMO, there are far more commonalities along those axes than there distinctions).

And then there's another issue of cultural influence; for example, the Japanese are considered to be relatively inclined towards collectivism or communalism or conformity or social harmony or social cohesion, in a way that seems to me to be in rather stark contrast to the intra-societal partisanship that we see in the U.S. Fijians are big into social harmony and communalism in ways seemed very foreign (and very refreshing) to me when I traveled there. But Japanese aren't exactly anti-tribal when you ask them about Koreans or Chinese, and Fijians have a long history of animosity towards Indo-Fijians.

So what if there were a strong initiative towards a new "tribe" in the U.S.?

Ohio Gov. John Kasich and Colorado Gov. John Hickenlooper have entertained the idea of forming a unity presidential ticket to run for the White House in 2020, a source involved the discussions tells CNN.

Under this scenario, Kasich, a Republican, and Hickenlooper, a Democrat, would run as independents with Kasich at the top of the ticket, said the source, who cautioned it has only been casually talked about.

How popular might such a tribe be? Would the existence of such a tribe complicate the calculus of how people align their beliefs with ideology? What the hell would they tell pollsters about climate change?

August 25, 2017 | Unregistered CommenterJoshua

@Joshua RE: your point about p-values and CIs. The problem with using CIs for this I think is that, yes, while focusing on a narrower CI may possibly be more useful than a p-value in some cases, A) p-values and CIs are intimately related, B) very, very few people interpret CIs correctly (including statisticians!) and C) related to A&B, CIs (and p-values) are based on an IMO pretty untenable assumption about repeated sampling (the assumption itself isn't flawed, it is just that it doesn't match the reality of how applied researchers use them). This excellent paper from Rich Morey is a great start on some of this:

So, while can definitely be good heuristics for precision (especially if calculating CIs for effect sizes), they give a false sense of 'confidence' in most cases, especially when interpreted as 'doesn't cross zero, then effect is significant'. What (I think) you'd really want here is a highest posterior density interval, which can actually be interpreted in probabilistic terms and I think conveys much more information about the precision of effects in a model. I just don't see mathematically how regular CIs can help much with false positives in a hypothesis testing framework, though I'm not no expert on this and would be willing to be persuaded.

August 25, 2017 | Unregistered CommenterDan C

"What (I think) you'd really want here is a highest posterior density interval, which can actually be interpreted in probabilistic terms and I think conveys much more information about the precision of effects in a model."

What we really need here is for people to understand the limitations of the tools they're using.

The basic problem is that the statistical tools are based on certain models and assumptions about the situation being studied, and they work fine when those assumptions are sufficiently accurate, but become meaningless when they're not.

A CI is just the interval of hypotheses that a significance test wouldn't reject, and therefore shares all the same problems as significance testing. Bayesian methods are probably the best we've got, but still have many problems; like there being no way to justify the priors you use prior to seeing any evidence, the fact that their calculation depends on having an accurate statistical model of the likelihoods, and the problem of figuring out what alternative hypotheses you haven't considered. Bayesian methods can talk about whether one specific hypothesis is more likely than another, but they don't deal well with "unknown unknowns".

Most textbook techniques rely on simplifying assumptions like "independent identically distributed Gaussian" random variables to make the calculations tractable, but this is only an approximation. Real world measurements are rarely if ever precisely independent, stationary, or Gaussian, and such residual confounders can give false positives far more often than expected with sufficiently huge sample sizes. A combination of textbooks that never point out their own limitations when it comes to the real world, with scientists who only have a "push-the-buttons" follow-the-textbook understanding of the statistics, conspire to paper over a morass of unfounded assumptions with the illusion of mathematical rigour.

The "replication crisis" arises from fundamentally misunderstanding the function of the peer-reviewed journals and the scientific debate. The original intention was that journal papers are incomplete work in progress - published more widely to be checked, challenged, extended, debunked, and debated. They are not and never were the "gold standard" of "settled science", and the checks applied to them by the journals are nowhere near sufficient to achieve that. They're the bit at the start of the meeting where the boss calls for ideas, people put suggestions up, and then they each get shot down. It's not in the least bit surprising that half of them are wrong. What is so utterly astonishing about the situation is that so many scientist have come to believe/expect that they weren't!

The idea behind the 95% confidence standard is not to say that 95% confidence that something is true is sufficient for 'science'. That's laughable! The idea is that 95% is sufficient evidence to make this a hypothesis worth taking seriously and spending time on to check further. It's a filter to allocate the scarce resource of researchers' attention productively.

Even if 95% confidence translated to 95% probability of truth (which obviously it doesn't), that's not good enough for much science. Any chain of reasoning depends on every link in the chain being true. If each link has a 95% chance of holding, then you can link together 13 logical steps before the probability of correctness drops below 50%, and the conclusion is more likely invalid than valid. That's not a lot, especially when a lot of science depends on concatenated chains of reasoning hundred or even thousands of steps long!

Science requires (and regularly achieves) confidence at the 99.9999% level, or higher. That's why its so successful. But that depends on people routinely checking and challenging peer-reviewed science, and everyone knowing that those checks are in place and effective. The fundamental problem is that scientists are not checking stuff properly any more, they're just "taking their colleagues' word for it".

There's one case I know of where a researcher found himself struggling with obviously corrupted data in previously published results, and rather than report and document the issues, he chose instead to make up false data to cover up the problem. He knew very well what that would do to the body of scientific knowledge - as he said: "It will allow bad databases to pass unnoticed, and good databases to become bad, but I really don't think people care enough to fix 'em, and it's the main reason the project is nearly a year late." The data he was working to update had been published years earlier, with no public indication of any problems with it, had been cited as evidence at the highest levels of policy-making, and was widely considered to be solid science. Supposedly it had been thoroughly reviewed multiple times. But what checks could they possibly have done when nobody had even noticed that even the institution that had produced it couldn't replicate the results, and was forced to make stuff up to get the thing to work?! It's not been officially admitted or corrected, even now.

Not only do modern-day scientists not check, they're **outraged** by the idea that anyone else might. They say: “It would be odious requirement to have scientists document every line of code so outsiders could then just apply them instantly.” “p.s. I know I probably don’t need to mention this, but just to insure absolutely clarify on this, I’m providing these for your own personal use, since you’re a trusted colleague. So please don’t pass this along to others without checking w/ me first. This is the sort of “dirty laundry” one doesn’t want to fall into the hands of those who might potentially try to distort things…” “The two MMs have been after the CRU station data for years. If they ever hear there is a Freedom of Information Act now in the UK, I think I’ll delete the file rather than send to anyone.” "We have 25 or so years invested in the work. Why should I make the data available to you, when your aim is to try and find something wrong with it. There is IPR to consider."

"No scientist who wishes to maintain respect in the community should ever endorse any statement unless they have examined the issue fully themselves." But thousands do. They "prostitute themselves" by offering support to scientific claims that nobody, let alone they themselves, have ever critically examined or checked. They trust others to have done it for them. Their belief is based on blind faith. And then the rest of us trust them.

It's all a hideous mess, that will take decades to sort out. But the problem isn't anything as simple as p-values versus CIs versus Bayesian likelihoods or Jeffrey's priors versus uniform priors. It's about trust and systematic scepticism, and an understanding that everything you calculate is founded on your statistical models being correct. It takes people with a wide range of different (often opposing) perspectives to see the holes, flaws, and implicit assumptions you can't. That's the scientific method.

August 25, 2017 | Unregistered CommenterNiV

another link drop:

no rhyme or reason to these link drops - just bright & shiny & recent & somewhat relevant to things we've discussed before.

Joshua - note that I've said I think SDO probably isn't innate (unlike RWA) - and that link seems to indicate similar findings as well: that it is very context specific. However, interesting that RWA & SDO most often found together.

As for a bipartisan Kasich/Hickenlooper ticket - interestingly stumbled on this today:
I had never heard of the Readjusters.

August 25, 2017 | Unregistered CommenterJonathan

Jonathan -

Interesting article. I hadn't heard of the readjusters, either. Although I will never forget hearing this stunning interview, on a very much related topic.

August 25, 2017 | Unregistered CommenterJoshua

@Dan C-- basically agree, although I think we'd all be better off if we stopped thinking about the "null" all the time. Pit 2 plausible hypotheses against each other a href="">and report the likelihood ratio associated with a study finding in relation to them. People of diverse priors could then make what they will of the study if you give them the necessary information to do so

August 26, 2017 | Registered CommenterDan Kahan

@NiV-- same thing I just said to @Dan C

August 26, 2017 | Registered CommenterDan Kahan


And how, then, do you calculate the likelihood ratio?

August 26, 2017 | Unregistered CommenterNiV

yet another link drop - maybe this is the real reason Dan is taking the fall at Haahved - he lost a thumb-wrestling match to other Yalers:
"Contrary to the popular motivated reasoning account of political cognition, our evidence indicates that people fall for fake news because they fail to think; not because they think in a motivated or identity-protective way."
and, asymmetrical:
"The link between analytic thinking and media truth discernment was driven both by a negative correlation between CRT and perceptions of fake news accuracy (particularly among Hillary Clinton supporters), and a positive correlation between CRT and perceptions of real news accuracy (particularly among Donald Trump supporters)."
also, personal:
"Finally, analytic thinking was associated with an unwillingness to share both fake and real news on social media."
Does this blog count as social media, and do recent science articles count as news?

August 26, 2017 | Unregistered CommenterJonathan

Jonathan -

At first I thought this might be a Poe:

We also found consistent evidence that pseudo-profound bullshit receptivity negatively correlates with perceptions of fake news accuracy; a correlation that is mediated by analytic thinking.

It would be funny if the research was testing readers' pseudo-profound bullshit receptivity ☺️

I wonder about how they generalize based on examination of fake news from the 2016 election. I would think that we might find more symmetrical patterns if we looked at reactions to fake news on a variety of issues and circumstances. My guess is that the 2016 election, and related news, might not be a representative sampling, particularly given the potential influence of Russian fake news campaigns towards the goal of advancing Trump's candidacy. For example, Trump supporters might be more circumspect and skeptical on other issues and electoral contexts?

August 26, 2017 | Unregistered CommenterJoshua

:"I wonder about how they generalize based on examination of fake news from the 2016 election. [...] My guess is that the 2016 election, and related news, might not be a representative sampling, particularly given the potential influence of Russian fake news campaigns towards the goal of advancing Trump's candidacy."

Ha! Ha! Very funny!

Oh, sorry, hang on a second. You wasn't serious, was you?

"At first I thought this might be a Poe:"


"Does this blog count as social media, and do recent science articles count as news?"

Yes. :-)

August 26, 2017 | Unregistered CommenterNiV

"Since we took an ecological approach and selected actual fake news stories, it is possible that differences in perceptions offake news accuracy between Clinton and Trump supporters could be the result of the items that we happened to select (e.g., the Democrat items may be less convincing than the Republican items despite being equally partisan)."


"Thus, as an additional test of baseline differences in fake news susceptibility between the liberals and conservatives in our sample, we also included a set of neutral news stories that did not contain political content (e.g., “Because Of The Lack Of Men, Iceland Gives $5,000 Per Month To Immigrants Who Marry Icelandic Women!”)."

And how does that help?

August 26, 2017 | Unregistered CommenterNiV

NiV: "And how does that help?" - would $10,000 be more enticing?

August 26, 2017 | Unregistered CommenterJonathan

"NiV: "And how does that help?" - would $10,000 be more enticing?"


They quite correctly say "it is possible that differences in perceptions offake news accuracy between Clinton and Trump supporters could be the result of the items that we happened to select", which seems to me like a serious flaw in their methodology, but the following sentence starts "Thus, ..." as if this was some sort of correction or mitigation of the problem. However, all they're doing is offering another handful of fake news stories, which surely suffers from exactly the same problem. It's a small sample of very specific examples, with no guarantee that the distribution of responses would be the same even if the susceptibility of subjects to fake news was identical. And the results for a new sample of politically neutral stories tell us nothing about the validity/biases or otherwise of the original set of politically partisan stories. It's a non sequitur.

Consider for a moment how a reader is supposed to distinguish fake news from real. One possibility is "plausibility", meaning how well it fits into people's models of how the world works and what sorts of things happen in it. But given some of the crazy-sounding events that turn out to be real, and given that the sensationalist media is drawn to dramatic stories about extremes and outliers, that's rather difficult. Hillary Clinton set up her own private email server in a friends' bathroom and had her flunkies email classified documents to it to get round the laws on federal records keeping and disclosure?! Who would have believed that was plausible/conceivable? Truth is sometimes stranger than fiction.

So the only other possibility is whether people happen to have seen that particular story before, and seen it either debunked or confirmed. News of the debunking/confirmation spreads through social networks unevenly, so you might have had the news about some stories get out to a far wider proportion of your sample subjects than others. In particular, the spread through social networks is quite likely to be correlated with political tribe, making it a classic confounder. When you're only offering 15 stories, it wouldn't be hard to have made an unfortunate choice.

This is the same problem we were discussing above: it applies to those CIs they published on the bar charts - they're calculated based on assumptions about the sample being representative and the metrics giving the same response to the underlying 'fake news susceptibility' variable been measured. But it's very difficult to take methodological uncertainties into account - the problem of "unknown unknowns". Larger sample sizes wouldn't help. To calculate CIs, you would need to know how variable the response was to different selections of fake/real stories, and they've got no data on that.

Setting stricter standards on p-values for publication does nothing to mitigate methodological flaws, and so long as a large proportion of papers continue to have such flaws, (which I think is pretty much inevitable,) stricter standards on p-values does nothing for the 'replication crisis'. What they need is motivated critics with a *different* set of biases and cognitive blind spots to apply systematic scepticism, and try to poke holes in the studies. That's supposed to be the function of post-publication peer review. Only when lots of people like me have had a go and failed to find anything wrong with it can you consider it (even tentatively) 'settled science'.

August 26, 2017 | Unregistered CommenterNiV


"What?" - it was a joke. I pretended to think that by "And how does that help?" you were referring to the content of that particular fake news item instead of the use of neutral fake news in the test, and responded (in)appropriately. Get it now?

August 26, 2017 | Unregistered CommenterJonathan

Ah! I see!

Yes. Good joke.

August 26, 2017 | Unregistered CommenterNiV


Here's a peace offering:

August 26, 2017 | Unregistered CommenterJonathan

Jonathan -

Thanks for that link...

Just as there is almost certainly greater protection against government prohibition of speech than ever before, there also seems to be greater social intolerance toward many types of speech than there has ever been
in the past.


"...seems to be..."

A rather remarkable statement. But while there is a notable lack of evidence presented to quantify the argument, there is at least some nuanced discussion.

While there is much greater tolerance of speech that would have been subject to social as well as legal censure in the not too distant past including seditious and sexually oriented speech, there is far less tolerance of speech that might be deemed offensive or insensitive with respect to race, gender, sexual orientation or speech that might be considered offensive to or by various cultural or social groups.

First, it would be interesting to see some quantification of the decrease in the type of speech being described, over time. I would imagine that it would be easy to see more public condemnation of such speech, but that doesn't necessarily mean that there is less of such speech. More condemnation does not necessarily imply a particular diminishment in response. There are many avenues for such speech that are readily available that never existed before. Consider the comment sections at Breitbart. Consider the massive media outlets where such speech is readily available - Fox News, Hannity's radio show, Limbaugh, Ingraham, Drudge.... Despite all the hand-wringing from snowflakes about their inability to express their racist, homophobic, etc., views, in fact they may have a much wider range of vehicles to express those opinions.

And, of course, if we were able to track and quantify racist, homophobic, etc., speech over time, and were able to detect a trend of diminishment, how exactly would we attribute causality? Should we assume that the snowflakes who hand-wring and pearl-clutch (from their fainting couches) about their loss of free speech are correct in painting themselves as victims of people who object to their racist, homophobic, etc., language, How would we know that the putative diminishment wouldn't merely because a smaller % of people embrace racist, homophobic, etc. views, not because people so deeply concerned about pushback and thus are intimidated and won't express their true views? Maybe if there is a smaller % of such people, it is because more people have determined that such views are detestable?

And lastly, how would such changes in expression track with other changes in how people express themselves?

August 26, 2017 | Unregistered CommenterJoshua

, there is far less tolerance of.... speech that might be considered offensive to or by various cultural or social groups.

I wonder if there is more or less tolerance today for criticism, say, of heterosexuals, or WASPS, or white people in general, or our military leaders, or politicians, or scientists, or priests ornithology religious leaders or the catholic church hierarchy, than there was decades or centuries ago?

Of course things ain't like they used to be. But then again, they never were.

August 26, 2017 | Unregistered CommenterJoshua

Ornithology? ☺️

August 26, 2017 | Unregistered CommenterJoshua

Evidently, the spell checker thinks your arguments are for the birds.

August 26, 2017 | Unregistered CommenterJonathan


A peace offering for you, too:

August 26, 2017 | Unregistered CommenterJonathan

Well, many people have said my views are fowl foul and cuckoo.

August 26, 2017 | Unregistered CommenterJoshua

"Evidently, the spell checker thinks your arguments are for the birds."

Realized this means that NiV and Ecoute can be replaced by a bot!

Hmmm - guess that means I should give a peace offering to Ecoute as well, to be fair:

August 26, 2017 | Unregistered CommenterJonathan

@Joshua-- is They Saw a Protest a good model of the tolerance for dissident speech? Or is the intolerance based on something else?

August 27, 2017 | Registered CommenterDan Kahan

"NiV, Here's a peace offering:"

Thanks. But no peace offering is necessary. So far as I'm concerned we're not in conflict. I'm actually enjoying the discussion, and just because I disagree with something doesn't mean I'm unhappy to have seen the point of view being put forward. I see debate as a way of exercising and testing my justifications for my beliefs. My confidence in them relies on the skills of those arguing against them, and therefore I seek out people who will argue with me with the greatest motivation, knowledge, and skill, without getting nasty about it or upsetting anybody. That's why I come here.

The loss of so important an aid to the intelligent and living apprehension of a truth, as is afforded by the necessity of explaining it to, or defending it against, opponents, though not sufficient to outweigh, is no trifling drawback from, the benefit of its universal recognition. Where this advantage can no longer be had, I confess I should like to see the teachers of mankind endeavouring to provide a substitute for it; some contrivance for making the difficulties of the question as present to the learner's consciousness, as if they were pressed upon him by a dissentient champion, eager for his conversion.

But it was an interesting essay, that explains some troublesome issues well. "The realization of freedom of speech is dependent on a culture that values it." Freedom of speech requires that people be free to criticise free speech and its exercise in aggressive and intimidating terms that could easily put people off, which itself undermines the exercise of free speech. It's necessary that harsh criticism be allowed for debate to serve its truth-seeking function. But our only legitimate defence against its overuse is that we *choose* not to misuse it, which depends on us jointly choosing to live in a society where people will make the effort to maintain it, where they understand its value. It's arguable that a society that doesn't value free speech doesn't deserve it. But them how should you proceed when living in a society that clearly doesn't value it?

That's provoking some thought on my part, for which I thank you.


I was a bit less impressed with your peace offering to Joshua (understandable, I suppose :-)), but it was a thoughtful attempt. I think the bit I disagree with most was this part:

"So when people claim that aiding and abetting gay marriage would infringe on their religious liberty, in most cases what they must mean is that this would violate their particular conception of positive liberty – their particular conception of how we each should live, a conception that is based on their religious views."

There is on the one hand the religious person's duty to encourage others to live a moral life (as they define it) which I think is what the author is describing. But I disagree that that's the issue here. Christians in the US have long since relinquished any expectation of being able to make others live moral lives, except by setting an example and by verbal persuasion, which are not at issue here.

No, the issue is a person's freedom of *action*, and the principle that free trade means that any commercial transaction is entered into voluntarily by *both* parties. These rules are denying people the opportunity to choose who they do business with and what business they do. Moreover, it's selective and targeted. You can discriminate on some grounds and not others. Some people can discriminate and not others. Suppose it was discovered that a large group of customers had organised a "boycott" of a certain shop chain, because they had political or social views they disagreed with. Could the shopkeeper sue them, and insist that those customers trade with them, to insist that they choose who to shop with based on price and quality alone? Can you refuse to deal with someone because they are, say, a white supremacist? An ex-convict? A critic of your diversity policies? A Trump supporter?

The basis of individual liberty is that nobody can stop you doing what you want to, but that nobody else is required to cooperate to make it possible. That would be an abridgement of *their* liberty. Cooperative enterprises (like trade) require consent from *all* parties to be liberal. For some parties to be able to compel the others, arguing that without their cooperation they would be constrained from doing what they wanted, would legitimise slavery.

"Indeed, for those who have any doubt about this, simply imagine what it was like to experience life as a black person under Jim Crow."

Or imagine living as a homophobe in a 'politically correct' society. There's no difference in the effect. Just in whether you think the cause justifies the treatment. And given the frequency with which such social fashions change, it cannot be anything but arbitrary.

The distinction is that Jim Crow laws were *laws*, not people's individual choices. If you wanted to set up some businesses that offered segregated services and others that didn't, and let the marketplace demonstrate which was more profitable, that would be one thing. But to make it a *law* abridges the freedom of people to offer the services they want, just as anti-segregation laws would.

In any case, we have segregation on other grounds - there are no-smoking premises with their groups of poor smokers stood huddled outside in the rain, we have private clubs that offer members-only services, we have sex-segregated or gender-segregated toilets and changing rooms. The principle is the same.

I'm personally not in favour of refusing services to LGBT people - I *am* one, so I'm well aware of the problems it causes. But I'm a lot more tolerant of other people's right to have unpopular opinions and choices as a result, even when they cause me problems.


"How would we know that the putative diminishment wouldn't merely because a smaller % of people embrace racist, homophobic, etc. views, not because people so deeply concerned about pushback and thus are intimidated and won't express their true views? Maybe if there is a smaller % of such people, it is because more people have determined that such views are detestable?"

I find a useful exercise in assessing such statements is to switch the sides, and see if the resulting mirror-statement makes logical sense and is morally acceptable. For example:

"How would we know that the putative diminishment wouldn't merely because a smaller % of people embrace perverted, homosexual, sexually deviant etc. views, not because people so deeply concerned about pushback and thus are intimidated and won't express their true views? Maybe if there is a smaller % of such people, it is because more people have determined that such views are detestable?"

Up until the past few decades, homosexuality was indeed widely regarded as morally detestable. (It still is in many countries.) And people with pro-homosexuality views kept very quiet about it because they feared social pushback. There were probably genuinely fewer people who supported liberalisation as a result. Was that good?

August 27, 2017 | Unregistered CommenterNiV

Dan -

I couldn't follow your question. Could you elaborate?

August 27, 2017 | Unregistered CommenterJoshua

I criticized this post back when it went up on Twitter, and Mr. Kahan (or at least whoever runs @cult_cognition) asked me to comment here, so here goes.

First, the to-be-clears. No one loves p-values and thinks they’re the end-all, be-all to separating good results from bad. Reducing the p-value threshold wouldn’t address *all* of the problems that led to the replication crisis. (For example, if bad study design skews the results in one direction, adding more observations will continue to find an effect because of that skew and will drive the p-value down, making the bad result look stronger.) I have no idea if .005 is the exact right tradeoff to strike in regard to how expensive it would make studies. And I’m a lowly journalism major who took a few free online courses in statistics!

But I do think it would address one cause of the crisis, and a big one.

Here’s my core point: If you’re looking for an effect that does not exist, and thus can get a significant result only by chance (assuming your experiment is designed properly), you’ll get a significant result 5% of the time with the threshold at .05. When labs all over the country are running silly experiments on small groups of sophomores, a 1/20 chance starts to add up to a lot of scintillating but bad results.

Reducing the threshold to .005, your chances fall all the way to 1/200. And because experiments would have to be bigger – even if an effect is real and pretty big, you probably won’t hit .005 without a lot of observations -- I think people would do fewer silly ones.

(The issue is quite different where sample size is out of your control – when you’re analyzing a data set from the Census, or death tolls of hurricanes since 1950, for example. In these cases the lower threshold would make it impossible to find effects that aren’t big enough to hit .005 within the constraints of the unchangeable data set, which is one huge entry in the "con" column.)

Some more concrete numbers: Say researchers test 1,000 hypotheses, 25 percent of which are actually true. If they correctly identify all the true hypotheses (which of course they almost certainly won’t), they’ll get 250 true positives, plus 38 false positives at .05 (5 percent of 750). 13 percent of their positives will be wrong.

By contrast, at .005, they’ll get only 4 false positives, and only 2 percent of their positives will be wrong. If we relax our assumption that experiments correctly identify all true positives, the improvement will be even greater, because bigger sample sizes will be able to detect smaller effects. And if people will test fewer outlandish hypotheses if they have to put together bigger samples, it’ll be even bigger still. Bigger samples could further require professors to work with colleagues at other schools to recruit participants; these others could help identify study-design flaws, and including samples from multiple schools could lessen the chance of a result that’s true but limited to a certain geographic/cultural area. Journals, of course, would still biased in favor of more surprising results – but they’d have far fewer to choose from.

And to link this to two more specific criticisms of the post:

* Kahan is correct that whatever p a single study finds, a precise replication will have a 50% chance of falling on either side of it. But I don’t think that’s the relevant yardstick here. The real question is, if a study hits p<X, what’s the chance that there’s actually nothing there? The p-value doesn’t directly measure that question (p<.05 is not a “5 percent chance of being wrong”), but it does correspond to it. In the above example, we reduced that chance from 13 percent to 2 percent just by tweaking the p-value.

* I don’t think it’s true that, to turn a p<.05 study into a p<.005 study, “the only thing that researcher has to collect more observations.” This is true only if the finding replicates – if the additional data suggest an effect in the same direction that the original data did. In a certain sense, then, requiring a bigger sample size is kind of like requiring researchers to replicate the finding themselves (with the above caveat about bad design), and therefore has an obvious connection to whether the finding is, well, replicable.

Again, I’m not saying this solves the problem of bad results entirely. But I do think a low p-value threshold would make them less likely.

September 3, 2017 | Unregistered CommenterRobert VerBruggen

Agh, I found an error in my previous comment! This sentence is wrong:

If we relax our assumption that experiments correctly identify all true positives, the improvement will be even greater, because bigger sample sizes will be able to detect smaller effects.

This goes in the opposite direction, actually. I'm talking about changing the p-threshold, not the sample size. So requiring a lower threshold will make it *harder* to find true results, not easier. The exact proportion will depend on how big the effects are.

But if people *did* increase the sample sizes enough to compensate, such that the same size effect has the same chance of returning a significant result, the effect should be to weed out the bad results while still finding the good ones.

September 3, 2017 | Unregistered CommenterRobert VerBruggen


I think we have different understanadings of what the root of the "crisis" is. You suspect people are teasing false signiticance values out of their data in order to support hypotheses formed before collecting data. I think they are forming "hypotheses" after the fact based on random bits of data that are correlated at p < 0.05. I'm sure they are doing both, actually, but it is the latter that I think is most consequential & disreputable. If people are fishing for ex post hypotheses in ponds of p < 0.05, their jobs will get easier, not harder, as they increase the sample size to meet the p < 0.005 threshold. As the sample size gets arbitrarily large, the differences in more & more variables become "significant," *increasing* the risk of Type I error and making p-values even less valid for identifying supportable inferences from the data.

Want examples? Check out the studies I adverted to. The ovulatory->women's voting prefences study reported multiple outcome variables at p <= 0.005. The fighting-passenter study reported multiple p < 0.0001. Himmicanes? Same thing-- p's <0.005, indeed, than 0.0001.

Those are bad studies b/c they reflected invalid theories, not b/c the correlations observed in their own data were not "real."

It will be costlier to play this sort of game if "P < 0.005" is made into a mechanical threshold for publication, but it'lll the same mindless game. The way to stop the game is to make researchers use "weight of the evidence" statistics (e.g., Bayes Factors) instead of "significance" ones. In a scholarly culture that learns to think about evidentiary inferences in terms of weight rather than significance, the boorish practice of manufacturing "WTF! p< 0.05" or "p < 0.001" will die out too.

September 4, 2017 | Registered CommenterDan Kahan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>