follow CCP

Recent blog entries
« Even *more* Q & A on "cultural cognition scales" -- measuring "latent dispositions" & the Dake alternative | Main | Look, everybody: more Time-Sharing Experiments for the Social Sciences (TESS)! »
Tuesday
Apr302013

Deja voodoo: the puzzling reemergence of invalid neuroscience methods in the study of "Democrat" & "Republican Brains"

I promised to answer someone who asked me what I think of Schreiber, D., Fonzo, G., Simmons, A.N., Dawes, C.T., Flagan, T., Fowler, J.H. & Paulus, M.P. Red Brain, Blue Brain: Evaluative Processes Differ in Democrats and Republicans, PLoS ONE 8, e52970 (2013).

The paper reports the results of an fMRI—“functional magnetic resonance imagining”— study that the authors describe as showing that “liberals and conservatives use different regions of the brain when they think about risk.” 

They claim this finding is interesting, first, because, it “supports recent evidence that conservatives show greater sensitivity to threatening stimuli,” and, second, because it furnishes a predictive model of partisan self-identification that “significantly out-performs the longstanding parental model”—i.e., use of the partisan identification of individuals’ parents.

So what do I think?  Not much, frankly.

Actually, I think less than that: the paper supplies zero reason to adjust any view I have—or anyone else does, in my opinion—on any matter relating to individual differences in cognition & ideology.

To explain why, some background is necessary.

About 4 years ago the burgeoning field of neuroimaging experienced a major crisis. Put bluntly, scores of researchers employing fMRI for psychological research were using patently invalid methods—ones the defects in which had nothing to do with the technology of fMRIs but rather with really simple, basic errors relating to causal inference.

The difficulties were exposed—and shown to have been present in literally dozens of published studies—in two high profile papers: 

1.   Vul, E., Harris, C., Winkielman, P. & Pashler, H. Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition, Perspectives on Psychological Science 4, 274-290 (2009); and

2.   Kriegeskorte, N., Simmons, W.K., Bellgowan, P.S.F. & Baker, C.I. Circular analysis in systems neuroscience: the dangers of double dipping, Nature Neuroscience 12, 535-540 (2009).

The invalidity of the studies that used the offending procedures (ones identified by these authors through painstaking detective work, actually; the errors were hidden by the uninformative and opaque language then typically used to describe fMRI research methods) is at this point beyond any dispute.

Not all fMRI studies produced up to that time displayed these errors. For great ones, see any done (before and after the crisis) by Joshua Greene and his collaborators.

Today, moreover, authors of “neuroimaging” papers typically take pain to explain—very clearly—how the procedures they’ve used avoid the problems that were exposed by the Vul et al. and Kriegeskorte et al. critiques. 

And again, to be super clear about this: these problems are not intrinsicto the use of fMRI imaging as a technique for testing hypotheses about mechanisms of cognition. They are a consequence of basic mistakes about when valid inferences can be drawn from empirical observation.

So it’s really downright weird to see these flaws in a manifestly uncorrected form in Schreiber et al.

I’ll go through the problems that Vul et al. & Kriegeskorte et al. (Vul & Kriegeskorte team up here) describe, each of which is present in Schreiber et al.

1.  Opportunistic observation. In an fMRI, brain activation (in the form of blood flow) is measured within brain regions identified by little three dimensional cubes known as “voxels.” There are literally hundreds of thousandsof voxels in a fully imaged brain.

That means there are literally hundreds of thousands of potential “observations” in the brain of each study subject. Because there is constantly varying activation levels going on throughout the brain at all time, one can always find “statistically significant” correlations between stimuli and brain activation by chance. 

click me! I'm a smart fish!This was amusingly illustrated by one researcher who, using then-existing fMRI methodological protocols, found the region that a salmon cleverly uses for interpreting human emotions.  The salmon was dead. And the region it was using wasn’t even in its brain.

Accordingly, if one is going to use an fMRI to test hypotheses about the “region” of the brain involved in some cognitive function, one has to specifyin advance the “region of interest” (ROI) in the brain that is relevant to the study hypotheses. What’s more, one has to carefully constrain one’s collection of observations even from within that region—brain regions like the “amygdala” and “anterior cingulate cortex” themselves contain lots of voxels that will vary in activation level—and refrain from “fishing around” within ROIs for “significant effects.”

Schreiber et al. didn’t discipline their evidence-gathering in this way.

They did initially offer hypotheses based on four precisely defined brain ROIs in "the right amygdala, left insula, right entorhinal cortex, and anterior cingulate."

They picked these, they said, based on a 2011 paper (Kanai, R., Feilden, T., Firth, C. & Rees, G. Political Orientations Are Correlated with Brain Structure in Young Adults. Current Biology 21, 677-680 (2011)) that reported structural differences—ones, basically, in the size and shape, as opposed to activation—in theses regions of the brains of Republican and Democrats.

Schreiber et al. predicted that when Democrats and Republicans were exposed to risky stimuli, these regions of the brain would display varying functional levels of activation consistent with the inference that Repubicans respond with greater emotional resistance, Democrats with greater reflection. Such differences, moreover, could also then be used, Schreiber et al. wrote, to "dependably differentiate liberals and conservatives" with fMRI scans.

But contrary to their hypotheses, Schreiber et al. didn’t find any significant differences in the activation levels within the portions of either the amygdala or the anterior cingulate cortex singled out in the 2011 Kanai et al. paper. Nor did Schreiber et al. find any such differences in a host of other precisely defined areas (the "entorhinal cortex," "left insula," or "Right Entorhinal") that Kanai et al. identified as differeing structurally among Democrats and Republicans in ways that could suggest the hypothesized differences in cognition.

In response, Schreiber et al. simply widened the lens, as it were, of their observational camera to take in a wider expanse of the brain. “The analysis of the specific spheres [from Kanai et al.] did not appear statistically significant,” they explain,” so larger ROIs based on the anatomy were used next.”

Using this technique (which involves creating an “anatomical mask” of larger regions of the brain) to compensate for not finding significant results within more constrained ROI regions specified in advance amounts to a straightforward “fishing” expedition for “activated” voxels.

This is clearly, indisputably, undeniably not valid.  Commenting on the inappropriateness of this technique, one commentator recently wrote that “this sounds like a remedial lesson in basic statistics but unfortunately it seems to be regularly forgotten by researchers in the field.”

Even after resorting to this device, Schreiber et al. found “no significant differences . . .  in the anterior cingulate cortex,” but they did manage to find some "significant" differences among Democrats' and Republicans' brain activation levels in portions of the “right amygdala” and "insula."

2.  “Double dipping.”Compounding the error of opportunistic observation, fMRI researchers—prior to 2009 at least—routinely engaged in a practice known as “double dipping.” After searching for & zeroing in on a set of “activated” voxels, the researches would then use those voxels and only those to perform statistical tests reported in their analyses.

This is obviously, manifestly unsound.  It is akin to running an experiment, identifying the subjects who respond most intensely to the manipulation, and then reporting the effect of the manipulation only for them—ignoring subjects who didn’t respond or didn’t respond intensely. 

Obviously, this approach grossly overstates the observed effect.

Despite this being understood since at least 2009 as unacceptable (actually, I have no idea why something this patently invalid appeared okay to fMRI researchers before then), Schreiber et al. did it. The “[o]nly activations within the areas of interest”—i.e., the expanded brain regions selected precisely because  they contained voxel activations differing among Democrats and Republicans—that were “extracted and used for further analysis,” Schreiber et al. write, were the ones that “also satisfied the volume and voxel connection criteria” used to confirm the significance of those differences.

Vul called this technique “voodoo correlations” in a working paper version of his paper that got (deservedly) huge play in the press. He changed the title—but none of the analysis or conclusions in the final published version, which, as I said, now is understood to be 100% correct.

3.  Retrodictive “predictive” models. Another abuse of statistics—one that clearly results in invalid inferences—is to deliberately fit a regression model to voxels selected for observation because they display the hypothesized relationship to some stimulus and then describe the model as a “predictive” one without in fact validating the model by using it to predict results on a different set of observations.

Vul et al. furnish a really great hypothetical illustration of this point, in which a stock market analyst correlates changes in the daily reported morning temperature of a specified weather station with daily changes in value for all the stocks listed on the NYSE, identifies the set of stocks whose daily price changes are highly correlated with the station's daily temperature changes, and then sells this “predictive model” to investors. 

This is, of course, bogus: there will be some set of stocks from the vast number listed on the exchange that highly (and "significantly," of course) correlate with temperature changes through sheer chance. There’s no reason to expect the correlations to hold going forward—unless (at a minimum!) the analyst, after deriving the correlations in this completely ad hoc way, validates the model by showing that it continued to successfully predict stock performance thereafter.

Before 2009, many fMRI researchers engaged in analyses equivalent to what Vul describes. That is, they searched around within unconstrained regions of the brain for correlations with their outcome measures, formed tight “fitting” regressions to the observations, and then sold the results as proof of the mind-blowingly high “predictive” power of their models—without ever testing the models to see if they could in fact predict anything.

Schreiber et al. did this, too.  As explained, they selected observations of activating “voxels” in the amygdala of Republican subjects precisely because those voxels—as opposed to others that Schreiber et al. then ignored in “further analysis”—were “activating” in the manner that they were searching for in a large expanse of the brain.  They then reported the resulting high correlation between these observed voxel activations and Republican party self-identification as a test for “predicting” subjects’ party affiliations—one that “significantly out-performs the longstanding parental model, correctly predicting 82.9% of the observed choices of party.”

This is bogus.  Unless one “use[s] an independent dataset” to validate the predictive power of “the selected . . .voxels” detected in this way, Kriegeskorte et al. explain in their Nature Neuroscience paper, no valid inferences can be drawn. None.

BTW, this isn’ta simple “multiple comparisons problem,” as some fMRI researchers seem to think.  Pushing a button in one’s computer program to ramp up one’s “alpha” (the p-value threshold, essentially, used to avoid “type 1” errors) only means one has to search a bit harder; it still doesn’t make it any more valid to base inferences on “significant correlations” found only after deliberately searching for them within a collection of hundreds of thousands of observations.

The 2011 Kanai et al. structural imaging paper that Schreiber et al. claim to be furnishing “support” for didn’t make this elementary error. I’d say “to their credit,” except that such a comment would imply that researchers who use valid methods deserve “special” recognition. Of course, using valid methods isn’t something that makes a paper worthy of some special commendation—it’s normal, and indeed essential.

* * *

One more thing:

I did happen to notice that the Schreiber et al. paper seems pretty similar to a 2009 working paper they put out.  The only difference appears to be an increase in the sample size from 54 to 82 subjects. 

Also some differences in the reported findings: in their 2009 working paper, Schreiber et al. report greater “bilateralamygdala” activation in Republicans, not “right amygdala” only.  The 2011 Kanai paper that Schreiber et al. describe their study as “supporting,” which of course was published after Schreiber et al. collected the data reported in their 2009 working paper, found no significant anatomical differences in the “left amygdala” of Democrats and Republicans.

So, like I said, I really don’t think much of the paper.

What do others think?

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (13)

As a non-scientist and someone not knowledgeable or smart enough to make statistical arguments, I will comment (in an typically off-topic manner) on the basic subject being analyzed rather than the statistical methodology employed.

I find the notion that differing brain physiology and/or functioning has a causal relationship (in either direction), or even an non-casual association, with differing political ideology to be highly implausible. It's one of those far, far greater diversity within a group than between two distinct groups kind of situation. The topic of this paper strikes me as being a strikingly fertile one for motivated reasoning to have significant impact.

Of course, I don't have an argument to make there that is based on analysis of validated data - so my own perspective is, no doubt, the product of motivated reasoning.

But for me, in such areas that are so fertile for motivated reasoning the bar needs to be raised. At least in such areas, I remain unconvinced of arguments of causality unless they are accompanied by a sophisticated and quantified theory of mechanism. In this case, the why and how of the causal relationship need to be explicated.

Further, I think that causality should be considered highly speculative unless it reaches well-articulated levels of robustness - along the lines of exploration of "my thesis would be invalid if X,Y, or Z were true" kinds of thinking.

I remember reading a while back - I think it was in the articles related to the paper I am linking below - a simple list of criteria for testing causal explanations. I really wish I had bookmarked that list. I'm wondering if you might take a stab at providing a simple one - not in econometric terms - but with respect to metrics such as testing the robustness of the causality by testing what happens if you reverse the direction of the interaction being measured. (Sorry for being so vague - I'll understand if you can't make heads nor tails of my request.)

Finally - a bit even further off-topic and not exactly consistent the criteria I listed above - I was struck by the high level of rigor in this study. I am curious to know what you think of it as a model.

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0057873

http://epianalysis.wordpress.com/2013/02/27/sugardiabetes/

April 30, 2013 | Unregistered CommenterJoshua

@Joshua: Any thoughtful person can evaluate the soundess of causal inferences. The issues here aren't even about "statistics," really. "Statistics" presuppose that what one is measuring will support an inference conditional on observation. Assessment of whether the pressupositon is correct involves logic and practical judgment. Instilling the ability to engage in that sort of reasoning is certainly essential to generating the ordinary science intelligence that citizens of a democratic society need. Or at least so Dewey thought, and I find his argument persuasive (and his vision inspiring).

I know you have this ability, in any case. I have seen it in your previous comments

April 30, 2013 | Registered CommenterDan Kahan

Dan:

Read between the lines. The paper originally was released in 2009 and was published in 2013 in PLOS-One, which is one step above appearing on Arxiv. PLOS-One publishes some good things (so does Arxiv) but it's the place people place papers that can't be placed. We can deduce that the paper was rejected by Science, Nature, various other biology journals, and maybe some political science journals as well.

I'm not saying you shouldn't criticize the paper in question, but you can't really demand better from a paper published in a bottom-feeder journal.

Again, just because something's in a crap journal, doesn't mean it's crap; I've published lots of papers in unselective, low-prestige outlets. But it's certainly no surprise if a paper published in a low-grade journal happens to be crap. They publish the things nobody else will touch.

April 30, 2013 | Unregistered CommenterAndrew Gelman

@Andrew:

Well, I let others do in between line readings! The researchers are good ones; I hope they wouldn't -- doubt they would -- "stick" something in PLOS-one that they couldn't place elsewhere. Also, I'm not sure what to think of PLosONE; I have seen some bad papers in it, certainly, but I started w/ "uniformative priors" about quality of it (!) & haven't iterated enough to make a dent in them

Actually, one thing I could have addressed & maybe should have is what to say about science journalists who (predictably) gave "big play" to the paper. I thiink they should be able to make the sort of critical evaluation I made of the methods used in the paper. See what I said to @Josuha.

If the science journalists think they can't do this, I would say two things: (1) Sure you can! You guys make sense of complicated things all the time & then expalin them to me! You are a model for other professions (including even Drs) about how to *think* critically about evidence. (2) But if you really feel you should passively defer to "peer reviewers" etc, you shouldn't be doing science journalism; I & many othes are depedending on you to help me figure out what is known to science; you can be relied on for that only if you can think scientifically in Dewey's sense!

April 30, 2013 | Registered CommenterDan Kahan

"This is obviously, manifestly unsound. It is akin to running an experiment, identifying the subjects who respond most intensely to the manipulation, and then reporting the effect of the manipulation only for them—ignoring subjects who didn’t respond or didn’t respond intensely."

Excellent article! I find myself vastly entertained by noting all the parallels with techniques commonly used in a certain other area of science!

But I don't want to derail the discussion too much, so I'll stick to just the one:

this does not mean that one could not improve a chronology by reducing the number of series used if the purpose of removing samples is to enhance a desired signal. The ability to pick and choose which samples to use is an advantage unique to dendroclimatology.

Esper et al 2003

Good one, eh?

Sorry, I couldn't resist.

April 30, 2013 | Unregistered CommenterNiV

Andrew Gelman wrote: "PLOS-One, which is one step above appearing on Arxiv"

Ah, journal rank. Not to derail the comment thread with this only partially related topic, but the available empirical evidence suggests that journal rank would better reflect scientific soundness if it were inverted:
http://arxiv.org/abs/1301.3748
It is a review article of the available, peer-reviewed literature. And yes, it was rejected from Nature, Science and PLoS Biollogy - they felt the finding that high-ranking journals published the least reliable science was not novel enough:
https://docs.google.com/document/d/1VF_jAcDyxdxqH9QHMJX9g4JH5L4R-9r6VSjc7Gwb8ig/edit
Scroll to the bottom for PLoS Biol reviewer comments.

In other words: the data say you should only rely on research published in high-ranking journals only after it has been replicated in lower-ranking journals. It is important to keep the overall data in mind when looking at individual anecdotes.

May 2, 2013 | Unregistered CommenterBjörn Brembs

@Bjorn: That's as good a direction for comments to go as any. I've seen bad papers in PLOS-ONe, but given that the topic of this post is the validity of inference on observation, I have to consider also how many bad ones I've seen in other journals, including "top ranked" ones. A good number certainly. I don't have enough data from PLOS-One, I'd say, to determine if the ratio of good to bad is higher or lower than in other journals.

But here's some evidence I would consider probative: was *this* paper rejected by multiple journals between 2009 & Nov. 2012, when it was finally submitted to PLOS-One, which then accepted it? That would have an LR > 1 for the hypothesis, "PLOS-One publishes papers of lower quality than other journals."

Of course, how much greater than 1 is open for debate. And it would be only one piece of evidence.

May 2, 2013 | Registered CommenterDan Kahan

"That would have an LR > 1 for the hypothesis, "PLOS-One publishes papers of lower quality than other journals.""

It might just as well be that PLOS-One publishes longer papers than other journals. If all the other journals rejected it for reasons of length, and PLOS-One accepted it, that just goes to show they're less tight for space.

Arguing that the highest impact journals should have the best quality science is sort of like arguing that democratic elections should mean the politicians voted in are the best that humanity has to offer.

A high impact journal attracts authors who think journal impact is more important than the other stuff. It attracts editors who think working on a high-impact journal is an important career move. Access is a valuable asset, which a select few can control, and then it's a question of who you know rather than what you know, of whether you 'fit', of favours and tit-for-tat, of drama and celebrity. People on the right committees.

It's the way people are. Journals are select, but not necessarily in the way they're supposed to be. Given the weight it is given in academics' careers, people are desperate to get published. What do such pressures result in? What, as a student of human nature, did you expect?

A paper is only as good as the arguments and data it contains, and a journal's imprimatur is only as good as its readers. A dedicated highly technical niche journal has to attract an audience with the quality of content. A big journal can rely on its reputation. Which do you want?

May 2, 2013 | Unregistered CommenterNiV

@NiV:

You certainly make your hypotheses before peeking at the data. I admire you for that.

Take a look at the article & see if you think its length could possibly have deterred any journal (other than Bazooka Joe Bulletin) from accepting it

May 2, 2013 | Registered CommenterDan Kahan

Oh, I don't know. Have you ever read "How to Publish a Scientific Comment in 123 Easy Steps" by Prof. Rick Trebino?

But it was another famous case I was thinking of. Nature published the MBH98 Hockeystick paper. When McIntyre and McKittrick tried to publish a reply there listing all the errors in it (several of them the same as those you have listed for the neuroscience paper above) it went through an extended 8 month review process before finally being rejected - Nature said - on the grounds of length. It wouldn't fit within the 500 work limit they allowed for such responses.

The cynical might suspect that length is just the most convenient excuse available for rejecting a paper when they can't find anything technically wrong with it - an impression that might be heightened if you knew that M+M were initially asked to fit it into 800 words, or if you've noticed that other longer comments have previously been published. But length is the official reason.

I picked length as my example because of the history, but my point was that there are all sorts of reasons why a journal might include or reject papers besides quality. The assumption that higher prestige journals are any better is a common myth. I'm not surprised at bad papers being published in any journal - top or bottom.

May 3, 2013 | Unregistered CommenterNiV

@NiV:

I should say, I truly have no reason to think PLOS-One publishes psychology papers that are lower in quality than many other perfectly respectable peer-reviewed journals. It would only be a good thing, too, if the model it reflects succeeded and became much more common.

May 3, 2013 | Registered CommenterDan Kahan

Dan Kahan has deleted the authors' reply to his little blog on here several times now. Suspicious, but then again, Kahan isn't a scientist and he isn't talking to scientists, so he doesn't really expect his viewers to care.

February 19, 2014 | Unregistered CommenterSteveK

@SteveK:

The authors have not submitted a reply via comments or otherwise. Nor have I ever deleted any. They are welcome to reply via comment or guest blog post anytime-- as I've indicated to them. So are you (at least via comments; contact me via email if you'd like to do guest blog)

February 20, 2014 | Registered CommenterDan Kahan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>