## *Now* do you see an effect? CRT & political outlooks

Okay, the use of ggplot's density function was unhelpful.

But *this* is pretty darn good: a stacked area graph computed along the continuous "Left-right" outlook measure (which is coded so that conservatism increases as outlook score goes up). One can easily see how every score on CRT (a 3-item assessment) relates to Left-right values of interest in these four large datasets.

(Click on it for better view.)

I was exchanging views with someone who specializes in the relationship between political ideology and reasoning style. When I described the relationships here as "trivially different from zero," he objected, stating that "p < .01 means that it IS non-trivially different from zero . . . ."

Anyone disagree?...

## Reader Comments (19)

Dan, I think you know my feelings on this already! Putting aside all of the other problematic aspects of p-values and their misinterpretations (also, were there specific null and alternative hypotheses here that would actually warrant inspecting p-values?), this is nearly a 2,000 person sample size. Who cares what the p-value is? I want to see the estimate of the relationship (pearson r, assuming that's appropriate for the psychometric properties of the data) and the precision around that estimate (error estimates, CIs)....because that is that *actual* evidence. Or even better, why not examine credible intervals of a full posterior distribution!

The evidence we *should* care about (in my opinion) is the estimate and its precision, not if a p-value crosses some arbitrary threshold. For example, if the lower bound of a CI for the correlation is .01 and the upper bound is .09, then I'd argue that that is trivial at best. However, this starts to become a bit of a judgement call, but not one that p-values can answer for us. (i.e., should we care about a correlation that may be ~statistically significant~ [ugh...] but is only estimated to be just barely above zero with a margin of error? maybe, but that starts to come down to subject matter interpretation and not the statistics). Time to use better tools to evaluate evidence!

Say we observe a correlation of .02 (with CIs from .0001 to .04), but our sample size is 10,000+. Good chance that's a significant p-value, but who cares? this reflects too much academic focus on dichotomous thresholds and not on evaluating evidence. I've reviewed too many papers with folks making incorrect claims about things like this.

"When I described the relationships here as "trivially different from zero," he objected, stating that "p < .01 means that it IS non-trivially different from zero . . . ." Anyone disagree?..."Yes.

All p < 0.01 means is that your null hypothesis of absolutely no relationship is probably false, but in any epidemiological trial there are always a wide range of residual confounders that cannot be controlled for, and in most cases all p < 0.01 means is that you're experiment is sensitive enough to pick these up. That's why a lot of more experienced researchers insist on an odds ratio greater than 2 or 3 in non-randomised trials before they'll take any notice.

From here.

From here. There are thousands more where those came from.

A lot of researchers facing the "publish or perish" requirement to publish lots of papers to keep their jobs, tend to see 'significant' p < 0.05 findings an easy way to generate publishable results, and then an equally easy source of work subsequently refuting them all. It's a way for young, inexperienced and still naive researchers to get started in their scientific career, before they get old and cynical and understand how little p-values really mean. Hence this result: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/

But really, it depends in what sense you mean the word "trivial", exactly. I assume you meant "a small effect", which the p-value clearly has nothing to do with. Arbitrarily small effects can always be detected at any p-value with a sufficiently sample size.

If the point of this data is to show how much a CRT score correlates with political self-sorting, then it would make more sense to weigh political self-sorting responses somehow. In other words, data at the moderate midpoint is pretty uninformative, and data removed only slightly from the moderate midpoint indicates less than data at the extreme ends (and is subject to error about which side of the midpoint it is really on). This problem is most obvious in the 7/5 update Dan made to the end of http://www.culturalcognition.net/blog/2017/6/28/do-you-see-an-effect-here-some-data-on-correlation-of-cognit.html - which shows Conserv-Repub vs. Lib-Dem dichotomously with equal weights at all points on the political scale, so that near-moderates almost completely obscure the differences at the extremes. I note that Dan's comment on this last addition is "much better". Much better how?

@Jonathan-- "much better" than the goofy ggplot "density" outputs.

I gather you think the median splits were misleading. So you should prefer the graphic in this post: here you can see the effect over the entire range of Left_right modeled as a continuous measure.

@NiV & @DanC-- I agree 100% (+/5%). I was astonished by his comment.

@DanC-- so you'd like to see Bayes factors here? I think a testing how much more consistent data are w/ null than values in range around 0 in Cauchy distribution would still be uninformative. The question should be how much more or less consistent the data are w/ competing hypotheses that reflect predicted effects of ideology on CRT. The predicted effects should embody some sensible or at least defensible account of what size of correlation is practically relevant.

@DanK Inclined to agree with you in this case when it comes to BFs. A BF would be great if there were more informative competing hypotheses, for sure. But in the absence of that, doesn't really seem like selecting the strongest model is the goal in this case, but rather to understand the precision of an estimate of a relationship between ideology and CRT.

I would think just examining the most plausible values of the posterior distribution derived under multiple different priors would be sufficient (e.g., more 'conservative' regularizing priors versus others). In this case, I don't think you'd get anything drastically different from the frequentist analysis if p-value are dropped in favor of CIs (i.e., the amount data here will easily override the priors), except that the bayesian estimation approach would explicitly demand a focus on precision of estimates and can be interpreted in probabilistic terms (whereas frequentist CIs cannot). At least that's my understanding, as far as my limited bayesian knowledge goes.

Dan, I think your counterpart is right. I think it justifiable to proceed after observing a low R^2 low p result by operating under the assumption that you are seeing a true effect despite noise and confounders. I was taught in grad school not to discount low p because of low R^2, and that low p-values with low R^2 suggested, among other possibilities, the appropriateness of subpopulation analysis. How justified that philosophy of science is relates to the eternal issue of how much your observed variance reflects real intra-population variation as opposed to error.

Kudos on those graphs, which to me are definitely more useful than we've seen in the past, so hurrah. They do suggest that one reason the effect sizes typically posted on this blog look small with CRT is because of the imbalance in the available data towards the CRT-0 end of the spectrum...

As to what effects I do see, well, those pictures certainly do bear out my stereotype that capable Democrats are radical and capable Republicans are moderate.

@dypoon-- proceed to what exactly?

@NiV et al.--

agree, too, that ORs supply most of info one would be interested in--& better than R^2 given that outcome variable is categorical rather than continuous. Here, of course, ORs are very close to 1 even though "p value" is < 0.01 (a consequence, as has been noted, of humongous N's. The N's are humongous, btw, b/c respondent pools were being assembled for multiple experiments).

One could also compute probability that CRT would be > 0 rather than 0 answers correct (or any other value or set of values vs. any other) as one moves from "liberal democrat" to "conservative republican."

E.g. in upper left panel data set, Lib Democrat is 5%, +/- 3% (at LC = 0.95) more likely than Conserv Repub to get score above zero. The Lib Dem is 1.7%, +/- 1.2% more likely than Conserv Repub to score perfect 3.

Practically relevant? You tell me.

"I agree 100% (+/5%). I was astonished by his comment."I wasn't. It's a common belief, even in the science community, and quite understandable given how we commonly teach statistics (i.e. with clean "textbook" examples with perfect assumptions and exact model distributions). Hence the issues with the scientific literature I noted.

"One could also compute probability that CRT would be > 0 rather than 0 answers correct (or any other value or set of values vs. any other) as one moves from "liberal democrat" to "conservative republican.""Up to a point, yes. But the issue is that the null hypothesis has been selected to be easy to calculate with, rather than because it's believed to be accurate. The null assumed is usually that the 'dependent' variable has

exactly zerocorrelation with the 'independent' variable. A more plausible null would be that the sample has been selected to exclude major confounders and only a residuum of trivial confounders are left, inducing only small correlations. It's good enough to pick up big effects, but your instruments have a "noise floor" or "resolution limit" below which they cannot see, no matter how big your sample size.Apposite here is the tale of the Emperor of China's Nose. The Emperor lives in the Forbidden City and nobody is allowed to gaze on his visage, so how can you find out the length of his nose? Well, the Imperial Surveyors went round all the people in China and asked each of them how long they thought the Emperor's nose was. Then they averaged them to get an accurate answer. How accurate? Well, if you suppose each individual guess had a standard deviation of 2 cm (human noses are not all that different) and you asked a hundred million people, then the standard deviation of the average is 2 cm divided by the square root of 100,000,000 which is 10,000. So the accuracy of the estimate is about

2 micrometers. Pretty impressive, eh?!

If you asked sufficiently many people (how many?), you could work out the length of the Emperor's nose to an accuracy

less than the width of an atom!It should even be possible then, in theory, to use such surveys as some sort of electron microscope, watching the atoms bounce off and evaporate from the Imperial Schnozz!(For a more serious scientific example, consider the problem of estimating temperature by examining the length of a thread of mercury in a thermometer by eye. If each individual observation is rounded to the nearest degree Celsius, how many do you have to average to obtain the mean temperature of the world to an accuracy of a hundredth of a degree? Can you see any possible issues with that?!)

Resolving the atoms on the Emperor's nose is a ridiculous notion, obviously, but can you explain precisely what the flaw in the mathematics is? Which assumption is invalid? When you understand, you'll be more likely to see why low p-values for large sample sizes don't always mean what naive "textbook" statistical assumptions might suggest.

Self-reporting is possibly adequate for political affiliation.

However, I very much doubt any significant number of subjects (and people, in general) have a worthy comprehension of what any ideology actually entails, including the one they might claim to profess.

I think that the problem is a little different than as stated by Flanigan.

Both American political parties are an amalgam of subgroups. The reasons that an African American resident of the Mississippi Delta votes Democratic are not the same as those of a gay resident of the Castro District of San Francisco. Even though both may express their lead interest as “civil rights” their tribal affiliations are quite different.

The political parties nurture “base” groups that can be counted on for votes, but also are driven by a need for campaign funding. In the US this has led to the growth of a Neoliberal elite which I believe does not represent the center of a political spectrum of the American people. I don't think that it is likely at all that the political tapestry of our society can be well approximated as a line.

Both political parties in the US are deeply entrenched and even though close to as many Americans are now registered in neither party as in either one, it is very difficult for a candidate who is not running under the umbrella of one of the two major parties to win. In the primaries however, it is possible, in some cases, to come in from extremes, using strong special interest base voters to win the nomination as the party's candidate.

France, on the other hand, demonstrated in its most recent election that it's voters were willing to abandon the previously dominant political parties. Thus Macron has the potential to carve out a new definition of the political middle. We face global issues of technologically driven job displacement and climate change driven elimination of fossil fuels as the economic energy base. I think that the fact that something analogous to the French election did not happen in the US is more an artifact of differences in the political structure within which people can express their opinions than a difference in underlying concerns. As is the fact that in the UK this discontent popped out as an anti-Brexit vote.

I believe that the problems noted in the post above with over-reliance on p values is a symptom of a much bigger problem that academic social science has with data access. We live in an era of Big Data. Powerful and well moneyed forces are able to use their access to Big Data and their knowledge of cultural cognition to drive and divert public opinion in nefarious ways:

https://scout.ai/story/the-rise-of-the-weaponized-ai-propaganda-machine.

In contrast, from the vantage point of an outside the field observer, it seems to me that the data sampling abilities of academic social science have actually constricted. The sorts of polling done with all the best intentions by organizations such as Pew must be becoming less and less effective. Telephone polling? I know that I, and those I associate with, cannot be reached even by a randomized number dialing method that claims to reach cell phones. Such calls are sorted out.

Somehow academia needs to come together to find enough of the tools necessary to protect its own institutional survival. One of the outcomes of the "weaponized propaganda" effort is the recent election of Trump which seems to be leading to reduction of funding to both the social and the natural sciences. The tools of big data are being used to launch an attack on academic research. Thus eliminating much of the possibility of the generation of data that is inconvenient to the retention of the current power of the current elite. This is being accompanied by a regulation rollback.

As this AI article notes, the issue is much larger than that, the future of democracy hangs in the balance. The start in mounting an effective defense, in my opinion, lies in first acknowledging the extent of the problem.

@ Dan:

Proceeding to subpopulation analysis, machine-learning/rule-extraction, and other speculative/exploratory approaches based on the hypothesis that you're uncovering real and important structure in the variance, but that your initial model of it was not quite right and that it needs re-framing.

On related note, could you explain to me why you feel something wrong with significance only at large N? If you get the same result after replication, (as you effectively have in this study; the earliest sample is non-significant, and something strange happened in 2015-2016 *sar*I wonder what that could have been, hmmm*casm*) then it's not just a statistical fluke. It's an effect size that in this model is only visible at large N. Such a result says more about the model than about the population - a different model might well show the difference at much smaller N.

"On related note, could you explain to me why you feel something wrong with significance only at large N? [...] It's an effect size that in this model is only visible at large N."The problem is that any non-randomised experiment makes assumptions about the independence of sampling that can only be approximate. For simplicity of calculation, we generally assume these sources of error are

exactlyzero, and so long as they are much smaller than the effects we're trying to observe, it does no harm to do so. But using a large N doesn't just detect smaller effects, it also picks up smaller levels of 'measurement noise'. The p-value only tells you that the "exactly zero" null is false. It doesn't distinguish an effect from the residual noise of a billion potential unknowable confounders.Of course, "large" is a relative term.

The other issue, and the one I think Dan was talking about, is that strong evidence that the output is increased 0.000000000001% by this particular factor may be easily detectable by a sufficiently sensitive experiment, and therefore "statistically significant", it's not "practically significant" in a phenomenon that can vary 10-fold due to other factors.

Like, if you paint a square metre of your house roof white, you'll increase the albedo of the Earth and reduce global warming. A sufficiently sensitive experiment can in principle detect it with any level of

statisticalsignificance, but it's notpracticallysignificant. Nobody would be saved by the trillionth of a trillionth of a degree difference in temperature.NiV pretty much nails it there. From my perspective, I think these sample sizes are great, and would vastly prefer sampling at this level to any smaller study. But, what large samples highlight is the inferential inadequacy of p-values or notions of 'statistical significance' (whatever that actually means) to make use of the data at hand, while opening the door for much better inferences focusing on other information. Having a nice large sample is precisely why we want to focus on measurement precision. Why care about dispelling an arbitrary point-null hypothesis in this case when making decisions when we can just look at the estimates? OK, so the effect is not exactly 0, but maybe it is r of .05 with a credible interval from .03 to .06. As far as measurement goes, that might be a nice finding! But it is also important to keep in mind that with such seemingly small effects, it might make it very difficult to maintain high degrees of precision in subsequent studies, or to further unpack other aspects of the relationship.

I'd also just like to note that R-squared itself is a flawed metric and doesn't really indicate much about model quality/predictive capacity (at least in its usual interpretation), so I'd say that the p-value-to-R squared discussion above by dypoon also isn't necessarily what we want to base our decision criteria on when deciding about practical/statistical significance (though I don't care for that distinction in general anyways). Statistics rapidly advances by the day :)

Link drop - a counter to the asymmetry thesis:

https://www.cjr.org/analysis/breitbart-media-trump-harvard-study.php

Opps - out double-negated myself - I meant a counter to the

symmetrythesis. Or, a counter to the non-asymmetry thesis, if you prefer.@Dypoon -- I agree w/ @NiV & @DanC. But there is also relevant discussion at pp. 13 of this paper.