follow CCP

Recent blog entries
« Let's keep discussing M Turk sample validity | Main | What's a "valid" sample? Problems with Mechanical Turk study samples, part 1 »
Wednesday
Jul102013

Fooled twice, shame on who? Problems with Mechanical Turk study samples, part 2

From Mueller, Chandler, & Paolacci, Soc'y for P&SP, 1/28/12This is the second post in a two-part series on what I see as the invalidity of studies that use samples of Mechanical Turk workers to test hypotheses about cognition and political conflict over societal risks and other policy-relevant facts.

In the first, I discussed the concept of a “valid sample” generally.  Basically, I argued that it’s a mistake to equate sample “validity” with any uniform standard or any single, invariant set of recruitment or stratification procedures.

Rather, the validity of the sample depends on one thing only: whether it supports valid and reliable inferences about the nature of the psychological processes under investigation.

College student samples are fine, e.g., if the dynamic being studied is reasonably understood to be uniform for all people.

A nonstratified general population sample will be perfectly okay for studying processes that vary among people of different characteristics so long as (1) there are enough individuals from subpopulations whose members differ in the relevant respect and (2) the recruitment procedure didn’t involve methods that might have either discouraged participation by typical members of those groups or unduly encouraged participation by atypical ones.

Indeed, a sample constructed by methods of recruitment and stratification designed to assure “national representativeness” might not be valid (or at least not support valid inferences) if the dynamic being studied varies across subgroups whose members aren’t represented in sufficient number to enable testing of hypotheses relating specifically to them.

Etc.

Now I will explain why, on the basis of this pragmatic understanding of what sample validity consists in, MT samples aren’t valid for the study of culturally or ideologically grounded forms of “motivated reasoning” and like dynamics that it is reasonable to believe account for polarization over climate change, gun control, nuclear power, and other facts that admit of empirical study.

I don’t want to keep anybody in suspense (or make it necessary for busy people to deal with more background than they think they need or might already know), so I’ll just start by listing what I see as the three decisive “sample validity” problems here. I’ll then supply a bit more background—including a discussion of what Mechanical Turk is all about, and a review of how this service has been used by social scientists—before returning to the three validity issues, which I’ll then spell out in greater detail

Ready? Here are the three problems:

1.  Selection bias.  Given the types of tasks performed by MT workers, there’s good reason to suspect subjects recruited via MT differ in material ways from the people in the world whose dispositions we are interested in measuring, particularly conservative males.

2.  Prior, repeated exposure to study measures.  Many MT workers have participated multiple times in studies that use performance-based measures of cognition and have discussed among themselves what the answers are. Their scores are thus not valid.

3.  MT subjects misrepresent their nationality.  Some fraction of the MT work force participating in studies that are limited to “U.S. residents only” aren't in fact U.S. residents, thereby defeating inferences about how psychological dynamics distinctive of U.S. citizens of diverse ideologies operate. 

That’s the short answer. Now some more detail.

AWhat is MT? To start, let’s briefly review what Mechanical Turk is—and thus who the subjects in studies that use MT samples are.

Operated by Amazon.com, MT is essentially an on-line labor market.  Employers, who are known as “requesters,” post solicitations for paid work, which can be accepted by “workers,” using their own computers.

Pay is very modest: it is estimated that MT workers make about $1.50/hr.

The tasks they perform are varied: transcription, data entry, research, etc.

But MT is also a well-known instrument for engaging in on-line fraud.

MT workers get paid for writing fake product or service reviews—sometimes positive, sometimes negative, as the requester directs.

They can also garner a tiny wage for simply “clicking” on specified links in order to generate bogus web traffic at the behest of “requesters” who themselves have contracted to direct visitors to legitimate websites, who are in this case the victims of the scam.

These kinds of activities are contrary to the Amazon.com “terms of use” for MT, but that doesn’t restrain either “requesters” from soliciting “workers” or “workers” form agreeing to engage in them.

Another common MT labor assignment—one not contrary to MT rules—is the indexing of sex acts performed in internet pornography.

MT Requester solicitation for porn indexing, July 10, 2013

B. The advent of MT “study samples.” A lot of MT workers take part in social science studies.  Indeed, many workers take part in many, many such studies.

The appeal of using MT workers in one’s study is pretty obvious. They offer a reasearcher a cheap, abundant supply of eager subjects.  In addition, for studies that examine dynamics that are likely to vary across different subpopulations, the workers offer the prospect of the sort of diversity of characteristics one won’t find, say, in a sample of college students.

A while back researchers from a variety of social science disciplines published studies aimed at “validating” MT samples for research that requires use of diverse subjects drawn from the general population of the U.S. Encouragingly, these studies reported that MT samples appeared reasonably “representative” of the general population and performed in manners comparable to how one would expect members of the general public generally to perform.

On this basis, the floodgates opened, and journals of all types—including ones in elite journals—began to publish studies based on MT samples.

To be honest, I find the rapidity of the decision of these journals to embrace MT samples mystifying.  

Even taking the initial studies purporting to find MT samples “representative” at face value, the fact remains that Amazon is not in the business of supplying valid social science research samples.  It is in the business (in this setting) of brokering on-line labor contracts. To satisfy the booming demand for such services, it is constantly enrolling new “workers.”  As it enlarges its MT workforce, Amazon does nothing—zip—to assure that the characteristics of its “workers” won’t change in ways that make them unsuited for social science research.

In any case, the original papers—which reflect data that are now several years old—certainly can’t be viewed as conferring a “life time” certification of  validity on MT samples.  If journals care about sample validity, they need to insist on up-to-date evidence that MT samples support valid inferences relating to the matters under investigation.

The most recently collected evidence—in particular Chandler, Mueller, Paolacci (in press) [actually, now published!] & Shapiro, Chandler & Mueller (2013)—doesn’t justify that conclusion.  On the contrary, it shows very convincingly that MT samples are invalid, at least for studies of individual differences in cognition and their effect on political conflict in the U.S.

C.  Three major defects MT samples for the study of culturally/ideological motivated reasoning

1.  Selection bias

Whatever might have been true in 2010,  it is clear that the MT workforce today is not a picture of America.

MT workers are “diverse,” but are variously over- and under-representative of lots of groups.

Like men: researchers can end up with a sample that is 62% female.

African Americans are also substantially under-represented: 5% rather than the 12% they make up in the general population.

There are other differences too but the one that is of most concern to me—because the question I’m trying to answer is whether MT samples are valid for study of cultural cognition and like forms of ideologically motivated reasoningis that MT grossly underrepresents individuals who identify themselves as “conservatives.”

This is clear in the frequencies that researchers relying on MT samples report. In Pennycook et al. (2012),  e.g., 53% of the subjects in their sample self-identified as liberal and 25% identified as conservative.  Stratified national surveys (from the same time as this study) suggest that approximately 20% of the general population self-identifies as liberal and 40% as conservative.

In addition to how they “identify” themselves, MT worker samples don’t behave like ones that consisted of ordinary U.S. conservatives (a point that will take on more significance when I return to their falsification of their nationality).  In an 2012 Election Day survey, Richey & Taylor (2012)  report that “73% of these MTurk workers voted for Obama, 15% for Romney, and 12% for ‘Other’ ” (this assumes we can believe they were eligible to vote in the U.S. & did; I’ll get to this).

But the reason to worry about the underrepresentation of conservatives in MT samples is not simply that the samples are ideologically “unrepresentative” of the general population.  If that were the only issue, one could simply oversample conservatives when doing MT studies (as I’ve seen at least some authors do).

The problem is what the underrepresentation of conservatives implies about the selection of individuals into the MT worker “sample.” There’s  something about being part of the MT workforce, obviously, that is making it less appealing to conservatives.

Maybe conservatives are more affluent and don’t want to work for $1.50/hr.

Or maybe they are more likely to have qualms about writing fake product reviews or watching hours of porn and indexing various sex acts. After all,  Jonathan Haidt & others have found that conservatives have more acute  disgust sensibilities than liberals.

But in any case, since we know that conservatives by and large are reticent to join the MT workforce, we also can infer there is something different about the conservatives who do sign up from the ones who don’t.

What's different about them, moreover, might well be causing them to respond differently in studies from how ordinary conservatives in the U.S. population would.  There must be if we consider how many of them claim to have voted for Obama or a third-party candidate in the 2012 election!

If they are less partisan, then, they might not demonstrate as strong a motivated reasoning effect as ordinary conservatives would.

Alternatively, their decision to join the MT workforce might mean they are less reflective than ordinary conservatives and are thus failing to ponder the incongruity between indexing porn, say, and their political values.

For all these reasons, if one is interested in learning about how dispositions to engage in systematic information  processing are affected by ideology, one just can’t be sure that what we see in “MT conservatives” will generalize to the real-world population of conservatives.

I’ve seen one study based on an MT sample that reports a negative correlation between “conservativism” and scores on the Cognitive Reflection Test, the premier measure of the disposition to engage in conscious, effortful assessment of evidence—slow, “System 2” in Kahneman’s terms—as opposed the rapid, heuristic-driven, error-prone evidence neglectful sort (“System 1”).

That was the study based on the particular MT sample I mentioned as grossly overrepresenting liberals and underrepresenting conservatives.

I’ve collected data on CRT and ideology in multiple general population surveys—ones that were designed to and did generate nationally representative panels by using recruitment and stratification methods validated by the accuracy of surveys using them to predict national election results. I consistently find no correlation between ideology and CRT.

In short, the nature of the MT workforce—what it does, how it is assembled, and what it ends up generating—makes me worry that the underrepresentation of conservatives reflects a form of selection bias relative to the sort of individual differences in cognition that I’m trying to measure.

That risk is too big for me to accept in my own research, and even if it weren't, I'd expect it to be too big for many consumers of my work to accept were they made aware of the problem I'm identifying. 

BTW, the only other study I’ve ever seen that reports a negative correlation between conservativism and CRT also had serious selection bias issues.  That study used subjects enticed to participate in an experiment at an internet site that is targeted to members of the public interested in moral psychology. As an incentive to participate in the study, researchers promised to tell the subjects what their study results indicated about their cognitive style. One might think that such a site, and such an incentive, would appeal only to highly reflective people, and indeed the mean CRT scores reported for study participants (liberals, conservatives, and libertarians) rivaled or exceeded the ones attained by students at elite universities and were (for all ideological groups) much higher than those typically attained by members of the general public.   As a colleague put it, purporting to infer how different subgroups will score on the CRT from such a sample is the equivalent of a researcher reporting that “women like football as much as men” based on a sample of visitors to ESPN.com!

2. Pre- & multiple-exposure to cognitive performance measures

Again, Amazon.com isn’t in the business of furnishing valid study samples.  One of the things that firms that are in that business do is keep track of what studies subjects they recruit have participated in so that researchers won’t be testing people repeatedly with measures that don’t generate reliable results in subjects who’ve already been exposed to them.

The Cognitive Reflection Test fits that description.  It involves three questions, each of which seems to have an obvious answer that is in fact wrong; people disposed to search for and reflect on evidence that contradicts their intuitions are more likely to get those answers right.

But even the most unreflective, visceral thinker is likely to figure out the answers eventually, if he or she sees the questions over & over. 

That’s what happens on M Turk.  Subjects are repeatedly recruited to participate in studies on cognition that use the CRT and similar test of cognitive style.

What’s more they talk about the answers to such tests with each other.  MT workers have on-line “hangouts” where they share tips and experiences.  One of things they like to talk about are the answers to the CRT.  Another is why researchers keep administering an “intelligence test” (that’s how they interpret the CRT, not unreasonably) that we clearly know the answers to?

These facts have been documented by Chandler, Mueller, and Paolacci in an article in press [now out--hurry & get yours before news stand sells out!] in Behavior Research Methods.

Not surprisingly, MT workers achieve highly unrealistic scores on the CRT, ones comparable to those recorded among students at elite universities and far above those typically reported for general population samples.

Other standard measures relating to moral reasoning style--like the famous "trolley problem"--also get administered to and answered by the same MT subjects over & over, and discussed by them in chat forums.  I'm guessing that's none to good for the reliablility/validity of responses to those measures either.

As Chandler, Mueller, Paolacci note, 

There exists a sub-population of extremely productive workers which is disproportionately likely to appear in research studies. As a result, knowledge of some popular experimental designs has saturated the population of those who quickly respond to research HITs; further, workers who read discussion blogs pay attention to requester reputation and follow the HITs of favored requesters, leading individual researchers to collect fans who will undoubtedly become familiar with their specific research topics.

There’s nothing that an individual researcher can effectively do to counteract this problem.  He or she can’t ask Amazon for help: again, it isn’t a survey firm and doesn’t give a shit whether its workforce is fit for participation in social science studies.

The researcher can, of course, ask prospective MT “subjects” to certify that they haven’t seen the CRT questions previously.  But there is a high probability that the workers—who know that their eligibility to participate as a paid study subject requires such certification—will lie.

MT workers have unique id numbers.  Researchers have told me that they have seen plenty of MT workers who say they haven’t taken the CRT before but who in fact have—in those researchers’ own studies.  In such cases, they simply remove the untruthful subject from their dataset.

But these and other researchers have no way to know how many of the workers they’ve never themselves tested before are lying too when they claim to be one of the shrinking number of MT workers who have never been exposed to the CRT. 

So researchers who collect data on performance-based cognition measures from MT workers really have no way to be sure  that these very high-scoring subjects are genuinely super reflective or just super dishonest.

I sure wouldn’t use take a risk like this in my own research.  And I’m also not inclined to take the risk of being misled by relying on studies of searchers who have disregarded it in reporting how scores on CRT or other cognitive performance measures relate to ideology (or religion or any other individual difference of interest). 

3. Misrepresentation of nationality (I know who these guys are; but who are MT workers? I mean—really?)

Last but by no means least: Studies based on MT samples don’t support valid inferences about the interaction of ideology and cognition in polarizing U.S. policy debates because it’s clear that some fraction of the MT subjects who claim to be from the U.S. when they contract to participate in a study aren’t really from the United States.

This is a finding from Shapiro, Chandler and Muller (2013), who in a survey determined that a “substantial” proportion of the MT workers who are “hired” for studies with “US only” eligibility are in fact participating in them via foreign internet-service providers.  

I also know of cases in which researchers have detected MT subjects using Indian IP addresses participating in their "US only" studies. 

Amazon requires MT workers to register their nationality when joining the MT labor force. But because MT workers recognize that some “requesters” attach “US worker only” eligibility criteria to their labor requests, MT workers from other countries—primarily India, the second largest source of MT labor outside the U.S.—have an incentive to misrepresent their nationality. 

I'm not sure how easy this is to pull off since Amazon now requires US citizens to supply Social Security numbers and non-US citizens who reside in the US to supply comparable information relevant to tax collection.

But it clearly isn't impossible for determined, internet-savvy and less-than-honest people to do. 

Part of pulling off the impersonation of a US resident involves signing up for MT through an account at a firm that uses a VPN to issue US IP addresses to internet users outside the U.S.  Indeed, aspiring non-US MT workers have an even bigger incentive to do that now because Amazon, in response to fraudulent use of its services, no longer enrolls new non-US workers into the MT labor force.

Shapiro, Chandler & Muller recommend checking the IP addresses of subjects in “US only” studies and removing from the sample those whose IP addresses showed they participated from India or another country.

But this is not a very satisfying suggestion.  Just as MT workers can use a VPN to misrepresent themselves as U.S.-residents when they initially enroll in MT, so they can use a VPN to disguise the location from which they are participating in U.S.-only studies. 

Why wouldn’t they? If they didn’t lie, they might not be eligible to “work” as a study subjects--or work period if they signed up after the period in which Amazon stopped enrolling non-US workers. 

True, lying is dishonest.  But so are a great many of the things that MT workers routinely do for paying MT requesters.

Charmingly, Shapiro, Chandler and Muller (2013) also found that MT subjects, who are notorious for performing MT tasks at the office when they are supposed to be working, score high on a standard measure of the disposition to engage in “malingering.”

That’s a finding I have complete confidence in. Remember, samples that are not “valid” for studying certain types of dynamics can still be perfectly valid for studying others.

* * * *

The name for Amazon’s “Mechanical Turk” service comes from a historical episode in the late 18th century in which a con artist duped amazed members of the public into paying him a small fee for the chance to play chess against “the Turk”—a large, turban-wearing, pipe-smoking manikin who appeared to be spontaneously moving his own pieces with his mechanized arm and hand.

The profitable ruse went on for decades, until finally, in the 1820s, it was discovered that the “Turk” was being operated by a human chess player hidden underneath its boxy chassis.

Today social scientists are lining up to pay a small fee—precisely because it is so much smaller than what it costs to recruit valid general population sample—to collect data on Amazon’s “Mechanical Turk.”

But if the prying open of the box reveals that the subjects performing the truly astonishing feats of cognition being observed in these researchers’ studies are “malingering” college students in Mumbai posing as  “U.S. Democrats” and “Republicans” in between jobs writing bogus product reviews and cataloging sex acts in on-line porn clips, I suspect these researchers will feel more foolish than anyone who paid to play chess with the original “Turk.”

Some references

Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2011). Using Mechanical Turk as a subject recruitment tool for experimental research. Political Analysis, 20(3), 351-368. 

Chandler, J., Mueller, P., & Paolacci, G. Methodological Concerns and Advanced Uses of Crowdsourcing in Psychological Research (in press) Behavior Research Methods.

Experimental Turk: a blog on social science experiments on Amazon Mechanical Turk

Mueller, Chandler & Paolacci, Advanced uses of Mechanical Turk in psychological research, presentation at Society for Personality & Social Psychology, Jan. 28, 2012.

Pennycook, G., Cheyne, J. A., Seli, P., Koehler, D. J., & Fugelsang, J. A. (2012). Analytic cognitive style predicts religious and paranormal belief. [doi: 10.1016/j.cognition.2012.03.003]. Cognition, 123(3), 335-346.

Richey, S,., & Taylor, B. How Representatives Are Amazon Mechanical Turk Workers? The Monkey Cage,(2012).

Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science. doi: 10.1177/2167702612469015

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (24)

Dude, you need to take a writing course. Two paragraphs of this overheated jargony meandering pompous prose style and I just couldn't take any more. And I am open-minded and interested in the question you are talking about--the validity of MT samples! If you get someone else to publish an Executive Summary, maybe I'll come back and read it. In the meanwhile, bye bye...

July 10, 2013 | Unregistered CommenterSorry to Say It

@Sorry:

I actually hired an M Turk worker to style edit the post as an experiment. Oh well!

July 10, 2013 | Registered CommenterDan Kahan

Dan--

I'm frightened to confess that I thought this was really nicely done. Apparently I like pompous, meandering and "jargony" prose. And I'm totally close minded.

I had no idea anyone considered MT subjects constituted a broadly valid and reliable sample. They seem the poster children for anomalous outliers. Thanks for the education.

Isabel

July 10, 2013 | Unregistered CommenterIsabel Penraeth

@Isabel:

Now you know what's going to happen! Someone is going to accuse me of having "requested" M Turk workers to post nice comments on my posts!

Yes, it's clear the MT workers are very unusual; few people are willing to do the jobs they are doing period, much less for so little money.

They might still be suitable subjects for certain kinds of studies, though, I suppose. One could see if they notice a gorilla stroll by dribbling a basketball while they are cataloging the sex acts occurring in a wild orgy, e.g.

July 10, 2013 | Registered CommenterDan Kahan

I suppose your points might have some support, but since you fail to actually present data to justify most of them I will assume that you have self-selected to respond. Until you have data to back your points, you really are just an arm chair academic pretending to understand sampling based on what you got from wikipedia.

July 11, 2013 | Unregistered CommenterAtta

I really like the way you mixed in just the right amount of valid information with bullshit to make a delicious piece of crap article.

July 11, 2013 | Unregistered CommenterSammy

Hi Dan

Thanks for the thoughtful posts - and I always appreciate folks who take the time to question and ponder methodologies and their underlying assumptions. A comment and a question for you -

a. I have just recently employed Mturk to recruit subjects to pilot test some manipulations and examine scale reliability - I have noticed some significant variances compared to other samples myself in those few studies that has made me question the usefulness of Mturk. So I am hesitant about this sampling resource as well, even for purely experiment designs - and I have noticed a serious "pro-liberal" bias in the orientation of respondents to our scale tests when we ask about ideology (which we always do, which I bet most psych folk do when using it for scale development)

b. BUT - my main question to you is whether you believe sampling for experiment designs from large online panels like those provided by Survey Sampling International or Qualtrics are equally problematic. I have done several studies now using those panels and have achieved a great of heterogeneity of respondents when I use quota sampling by gender, race, education, and age - which gives me great ideological diversity and for most part the relationships and parameters have approximately matched general population RDD studies that I have conducted via telephone or others have conducted.

I wanted to get your thoughts on that as I am also looking to poke holes in my own approaches if I can.

Thanks for all your blogging - I often forward your posts to other colleagues!

Best

Erik
Ohio State

July 11, 2013 | Unregistered Commentererik

@Erik:

Great questions.

As I've tried to stress, my own view is that "sample validity" has to be assessed in relation to the phenomena being investigated: given what we know about *this* sample -- who it comprises, how it was assembled, etc. -- can we draw valid inferences from it about how the dynamics of interest operate in populations of interest "outside the lab?"

I think MT samples don't support valid inferences about individual differences in cognition that (a) are measured with performance measures the validity of which will be compromised by repeat exposure; or that (b) occur in members of the general population who vary in "ideology" or political party affiliation or cultural worldviews or like measures.

But M Turk samples might be fine for other things -- say, perceptual dynamics that one believes are invariant across people & are presented in experimental stimuli the "workers" could not have seen before.

I'd think about the validity of Qualtrics-assembled panels the same way.

I have had discussions w/ the technical staff at Qualtrics, who I found to be super knowledgeable & very straightforward in addressing issues like this. Not surprising, since Qualtrics is a professional survey firm & recognizes that the demand for its product will depend on it being in a position to control the quality of the samples it assembles.

Based on those discussions, my understanding is that Qualtrics is essentially a broker in the sampling recruitment industry. It will arrange to supply customers w/ samples collected from whatever independent sampling firms are in a position to supply the sort of sample that the customer needs -- i.e., the kind that will enable the customer to draw the sorts inferences the customer wants to draw from collecting data.

Qualtrics, on my understanding, doesn't represent that they can assemble a "nationally represenative" general population sample. If you need that, then they'll tell you should go talk to YouGov or Knowledge Networks & see if you are persuaded that the recruitment and stratification procedures they use satisfy you that the samples they are able to assemble satisfy your needs. I'm persuaded that both of those firms, which charge a a lot more than Qualtrics, will deliver a sample that is valid in that regard. Indeed, they are likely to get you something that the experience-rating that Nate Silver does suggests is better than many of the "established," blue-chip professional survey firms (e.g., Gallup, for sure) that use "random digit dial," which, it's well known, is suffering a mode of sampling experiing a validity "crisis" right now due to the wasting death spiral of "land-line" phones & the degrading response rates for that mode of surveying!

But unless you are in the business of forecasting election results or reporting "x% of American public believes y" types of findings, then it is unlikely validity for you will require a sample that is genuinely "representative" of the general public.

Say you are studying some phenomenon in which individual differences are critical. In that case, you will be interested in assuring only (1) that the sample has enough of the kind of subjects who differ in the way that is relevant to your study hypotheses and (2) that the recruitment methods were ones that didn't discourage typical representatives of people like that or unduly encourage atypical ones.

For that, my sense is that Qualtrics might well be able to deliver. They certainly get that that is what they need to supply you assurance of. Again, they broker sampling services. So they will have to arrange to get sample from firms that can supply subjects of the type you want, and recruit them in ways that would satisfy you that there is no selection bias issue, and likely use some post-collection stratification procedures too that they can convince you are valid (in talking to them, I could see they understood all of this very well & they didn't at all try to discount the complexities involved).

BTW, they don't use MT to construct samples! They laughed out loud when I posed this question to them. That deepened my confidence that they both know what they are doing and care about doing it right.

I myself haven't used Qualtrics nor have I seen any data that could help to demonstrate that in fact a sample they recruited for, say, a study of ideologically motivated reasoning on some disputed issue of risk (maybe climate change) is satisfactory. But I would be open to dealing with them myself & seeing what they came up with, and open to taking very seriously the study results of any scholar who relied on them & who could simply tell me what he or she did to attain a reasonable degree of assurance in that regard. (I do think Qualtrics would make its services more useful and attractive to scholars if it issued a "white paper" or equivalent that is super clear about the methods it uses, the standards it employs to assess sample quality, any data it has collected to validate its methods, and the identity of published studies etc that have used samples they've collected--if one submits a paper to a peer-reviewed journal, one would like to be able to supply such info to assure reviewers that the sample is valid!)

But another thing about Qualtrics: if you need a specialized nonrepresentative sample, you might well conclude that they are the most likely to be able to get you what you need. Firms that speicalize in constructing large national panels suitable for assembling "nationally represenative" samples are unlikely to be able to give you an N = 1,000 sample of "doctors" or "actuaries" etc! But if you want that, Qualtrics will see what it can do; or at least that's what they told me, and I found that very impressive, a fact I tucked away for future reference, since indeed I could see myself being interested in doing a study like that.

So those are my thoughts. Am eager to hear more of yours & more of others who are wrestling w/ these questions.

July 11, 2013 | Registered CommenterDan Kahan

Long time reader, first time commenter. I appreciate your goal of improving science, Dan, but I will have to respectively disagree on each of your points.

Point 1: First, I can’t think of many studies online or in a laboratory that will not suffer from some selection bias. In psychological studies with college students, these are students who have self-selected into psychology (disproportionately liberal women), and have selected to participate in a particular study. If a study is advertised as “A Study of Prejudice” (which is a good possibility), what are the chances a young conservative male college student would sign up, given both psychology major base rates and the topics of study? This criticism of MTurk would invalidate any study collected on a college campus or non-representative online sample. I hope you’re not going that far!

Of course, you’re also concerned that the type of conservative who signs up for MTurk studies differs from the typical US conservative. Again, I think this would be true of a conservative psychology major (certainly an outlier in his/her major). But, your criticism is really about how conservatives in MTurk samples are different, or will behave differently, than conservatives in the larger population. From my experience using MTurk to conduct studies in political and social psychology, this is not the case: in roughly equal measure, liberals and conservatives see their political opponents as more threatening and express more political intolerance against them, relative to sympathetic political groups (Crawford & Pilanski, 2013; Crawford & Pilanski, in press) and deny their political opponents aspects of humanness (Crawford, Modri, & Motyl, under review); they express biases in favor of their political ingroups over their outgroups in predictable ways (Crawford, 2012; Crawford & Xhambazi, in press); based on very subtle cues, they choose the “correct” political candidate according to their political preferences (Crawford, Brady, Pilanski, & Erny, in press); they even see political protestors they disagree with as more disruptive than those they agree with, even when those protestors aren’t actually behaving disruptively (Crawford, in preparation)! All of this data, showing equal amounts of prejudice, intolerance, hostility, and motivated reasoning among liberals and conservatives, have been collected via MTurk, from 2010 to presently.

I also think you’re prematurely conflating the more “shady” types of requests (e.g., fake reviews) with requests from social scientists. Do we know if workers do both types of requests with equal frequency, or if those who volunteer for social science studies are less likely to do the shady ones? My guess is that it’s the latter.
It’s also somewhat ironic that you’ve mentioned Haidt’s work, because one of his primary means of data collection (the website yourmorals.org) relies on non-representative samples, with disproportionately few conservatives. Who are these conservatives who would sign up for studies run by “liberal academics” anyways?!

Point 2: Repeated exposure to stimuli and sharing of information among participants are of course threats to validity, but I don’t know if those practices are any more rampant on MTurk than in university psychology departments. The only difference here is that we can observe MTurk workers when they share information in forums, but we can’t observe when students share their experiences in their dorm rooms, cafeterias, or libraries. As for repeat users, there are ways to flag MTurk IDs from previous studies so as to remove them from subsequent studies.

That said, I think you’re right that when your outcome variable is based on problem solving or “facts” rather than attitudes, you run the risk of MTurk participants looking up answers to those problems while completing the online study. Of course, this is the case with any online study, not just MTurk. It does suggest researchers need to exercise greater caution, perhaps seeking replication in “offline” samples.

Point 3: I agree that non-US users would provide invalid data, but I have to imagine the incidence is so small, which your post already implies. Plus, we’ve known for ages that people are motivated to misrepresent themselves in social science research (e.g., socially desirable responding—which you could argue that the anonymity of MTurk significantly reduces). What makes MTurk so unique a platform?

July 11, 2013 | Unregistered CommenterJarret

@Jarrett--

Points are well taken.

1 in particular. I think studies that are based on college student samples are also unlikely to support valid inferences about ideology & cognition. I said as much in part 1 (pointing out how strange it is to test "commmunication framing" for climate skeptics on undergraduates at an eastcoast university the students of which aren't at all like typical climate skeptics)

But I think one can have more confidence & less that samples have been selected in a manner free from this sort of bias -- that sort being underrepresentation of typical members or overrepresentation of atypical members of groups who differ in ways one is studying.

I very little confidence in M Turk when it comes to testing ideology & cognition.

I have plenty in samples collected in lots & lots & lots of other ways!

July 11, 2013 | Registered CommenterDan Kahan

Most of these concerns can be addressed/mitigated to an extent with the right technology, as Chandler, Mueller & Paolacci point out. Mturk enables researchers to study complex experimental manipulations with a relatively large and diverse sample. With a survey shop you're basically limited to experimental manipulations of question wording, maybe pictures---no interactivity. Realism/ecological validity and external validity are substantial concerns. Student samples allow interactivity/complexity but also have problems. They are small (usually < 100) and so have low power and high false discovery rates---especially problematic when researchers exploit “undisclosed flexibility” in their design and analysis---see http://pss.sagepub.com/content/22/11/1359.full. Of course, students generally also skew liberal, high SES, high-education, and students are still in their politically formative years, so quite different from typical voters.

And just because it is hard to recruit conservatives does not mean that those we do recruit behave differently. Taken on its face this objection would invalidate all survey research. All respondents (or groups of respondents who share characteristics) have different propensities to respond to surveys. And because some groups have low response propensities they will be under represented in every survey. Should all those responses be discarded? That seems unwise. And because this applies to all surveys, it is unclear how one would validate the true behavior.

It's also far from clear that conservatives on MTurk behave differently compared to conservatives participating in face-to-face surveys, which is as close to a gold-standard as we currently have. In fact, recent work demonstrates that turkers behave like ANES respondents wrt political identity, ideology and feelings toward political candidates. Paper here: http://stanford.edu/~jgrimmer/cc.pdf, validation here: http://stanford.edu/~jgrimmer/ccsup.pdf.

The point about CRT and conservatives is interesting but there's a problem---MTurk and online surveys have very different incentive structures, which would explain why the CRT-conservative relationship is noisier in online-survey data than in MTurk studies. Studies like Pennycook et al provide explicit attention checks that incentivize turkers to carefully read instructions and take studies seriously. With online survey shops, participants generally have no such incentive. These companies usually provide a small number of "points" that participants can use toward iPods and the like after they complete a survey, without satisficing- or attention checks. Thus, rather than incentivizing careful consideration and attention, this incentivizes participants to complete many surveys quickly in order to actually get some material reward. Sometimes this is fine, sometimes even preferable to Mturk, but it should mean more noise on cognitively challenging measures---which is consistent with the noisier CRT-conservative relationship in online survey samples. I don't know much about the CRT, but the vast majority of people in the online survey sample described in your paper got 0 CRT questions correct, which strikes me as a problem.

Data showing exactly how non-representativeness or anything else that affects our ability to conduct quality social science (then propose some solution) would be a great contribution. This is an important question and such problems may very well exist. But the data you've cited so far are confounded with differences in the incentive structures between (carefully executed) mturk studies and online surveys. And of course most of the existing research indicates that people from MTurk samples generally behave like everyone else, despite non-representativeness.

The out of country responses are indeed a problem. Technical solutions include inviting respondents to complete the surveys on a server you host, collecting IP addresses, and conducting additional validation checks. Of course, this is a problem for all online survey firms, not just for mturk.

July 11, 2013 | Unregistered CommenterSolomon

@Solomon:

These are good points. I think I might be less optimistic than you about how tractable the problems are. But so long as people see that there are problems that need to be addressed -- and that issues of sample validity can't be brushed aside as "procedurally settled" by publication of this or that paper that surveyed the demographics of M Turk users in 2010 or 2011 -- then the right way to settle questions about "whether" & "how easily" isn't to debate fixes but to carry them out & see what happens.

As your comments recognize, the "repeat exposure" & "sampling bias" problems are distinct.

Chandler, Mueller & Paolacci are focusing on the 1st one of these. To be effect, their solutions, as I read them, would require an industry-wide "registry" of M Turk study subjects. Unless all researchers share the id numbers of *all* the M Turk subjects they test & what tests they've administered to them, no individual researcher will be able to know whether the subjects he or she is testing are "naive" or "experienced." No one who is relying on the research as a source of knowledge will know.

As CMP put it:


These findings highlight that although large, the pool of workers available is not limitless. The ease and low effort of data collection enabled by crowdsourcing Web sites such as MTurk may make it tempting for researchers to quickly collect data, with little thought about the underlying quality
of the methods used. It is beyond the scope of this article to discuss the deontological aspects connected to crowdsourcing. Instead, we merely note that there are practical reasons why the research community should avoid overusing shared participant pools such as MTurk. For more commonly
used methods and measures, the pool of MTurk workers presents a “commons dilemma” for researchers: It should not be assumed that respondents are naïve, and groups of researchers would be better off if they could coordinate their recruitment efforts.

Do you think it is realistic to imagine all psychologists, political scientists & others who do scholarly research supplying such info? Who will set up the registry? Who will be sure that everyone is complying? A "commons dilemma"-- indeed a "prisoners' dilemma" at that.

And what to do in the meantime, since we are a long way from having such a system now.

On the "selection bias" issue, I think it is in fact *easy* to "recruit" conservatives via M Turk. But their underrepresentation means that there is some other influence that is guiding conservatives to select-in that makes the readily available supply different from ones in the world.

Is it a difference that affects validity? There's only one way to tell: by looking at how the M Turk conservatives behave compared to non M Turk ones. Even there, which non-M Turk ones is not a straightforward issue; I think Random digit dial is either no longer the gold standard or else gold has lost a lot of its value in the market of opinion reaserch. But the new currency, I think, is "convergent validity": compare modes that we have reason to believe are good & see how close the results are; accept methods that all perform tolerably well.

I don't think M Turk would pass that standard; this is based mainly on what I've seen about correlations between conservativism and the CRT among M Turk subjects, though, & maybe the "exposure" problem is what's causing my sense that the M Turk conservatives are not "normal." The evidence you cite is more on point in that regard & so I'll take a close look (many thanks for the refernces!)

But as I say: what we disagree about in terms of the feasibility of the "fixes" is not so important so long as we agree that there's a *problem* here & the current practice of treating M Turk samples as if they were a valid for studies of the interaction of individual differences in cognition & ideology must be critically re-examined.

July 11, 2013 | Registered CommenterDan Kahan

Our paper that I mentioned in the post above does provide at least some preliminary convergent validity wrt political ideology between mturk and face-to-face ANES samples.

These issues need to be properly contextualized in terms of the problems facing social science today. Conventional surveys do not afford much in terms of realistic intervention/manipulation, which is key to establishing scientific notions of causality (and has been the cornerstone of scientific discovery since the time of Galileo). Student samples do enable realistic manipulation, but are subject to the rather severe problems I mentioned above---actual false discovery rates rise far above 5% when researches resort to flexible data analysis---i.e., running additional subjects to reach power, trying every specification possible on really small samples, etc., not to mention the generalizability issues.

In many cases, mturk samples allow for a realistic intervention with a sample size that is reasonably large enough to assure the scientific community that the researcher did not find signal in noise (make a false discovery). Yes, there are issues with mturk that researchers need to think about and take precautions against. These must be examined with the actual study design in mind---if your study requires that participants don't understand experimental research, mturk may present problems. Much of the time this is not a relevant concern, in which case you can simply take the same precautions that survey firms do. Still, what's the alternative? I'm unconvinced that turkers are any worse than professional survey takers, a problem that's compounded when there's only trivial compensation.

July 12, 2013 | Unregistered CommenterSolomon

@Solomon:

What's your CRT score? 4? You are disqualified from any on-line study of cognitive reflection.

I agree w/ you about the contextualization.

And about the need for those committed to making sense of psychological dynamics by disciplined empirical observation & inference (what other valid way is there to make sense of anything?) to be mindful of how the empirically demonstrated vulnerability of human beings to self-deception can tempt them to accept lapses of discipline in their methods.

As I said, it is less important in the end whether we agree on particulars about mturk than that we agree on general principles about how to think about issues of validity. And it seems we do agree that the way is to think, & discuss, & not to reduce matters of judgment to thought-impervious rules & protocols

July 12, 2013 | Registered CommenterDan Kahan

Very interesting. I did not know about M Turk samples before. In biological research this same problem comes up constantly. In this research, the question is stated as 'What model organism will give valid inferences toward a solution of the problem to be solved and is cheap enough and well characterized enough to be an acceptable model organism?' For some studies, E. coli is a good model. For others, yeast or zebrafish are good model organisms. For some key studies, such as drug development, the only really good model is a broad population of humans, which is unaffordable both monetarily and ethically. So, some other organism, such as mice, rats, or cultured cells, is used but with the small print caveat that the inferences will only be right about 20% of the time. Without the model system, even a limited one, the studies can't be done at all.
I am eagerly awaiting further discussion.

July 12, 2013 | Unregistered CommenterEric Fairfield

I should mention that the CRT<->conservatism correlation found in Study 1 of the Pennycook et al. (2012) paper was fully mediated by religious belief (a pattern that holds in university samples as well). This explains the inconsistency of the relation.

July 16, 2013 | Unregistered CommenterGordon Pennycook

Dan - off topic:

I thought you might find this interesting:

http://oss.sagepub.com/content/33/11/1477.full

This paper examines the framings and identity work associated with professionals’ discursive construction of climate change science, their legitimation of themselves as experts on ‘the truth’, and their attitudes towards regulatory measures. Drawing from survey responses of 1077 professional engineers and geoscientists, we reconstruct their framings of the issue and knowledge claims to position themselves within their organizational and their professional institutions. In understanding the struggle over what constitutes and legitimizes expertise, we make apparent the heterogeneity of claims, legitimation strategies, and use of emotionality and metaphor. By linking notions of the science or science fiction of climate change to the assessment of the adequacy of global and local policies and of potential organizational responses, we contribute to the understanding of ‘defensive institutional work’ by professionals within petroleum companies, related industries, government regulators, and their professional association

Judith Curry posted about this on her blog. Many "skeptics" are not thrilled with the article. Shocker, I know.

July 16, 2013 | Unregistered CommenterJoshua

@Joshua,
Thanks for the great link.

July 16, 2013 | Unregistered CommenterEric Fairfield

@Gordon:

That's really interesting. Am I right that the paper doesn't report that analysis?

I recall that it reports zero-order correlations between the various predictors used to explain variance in religious (& paranormal) beliefs -- which is where the negative correlatoin between conservativism & CRT is reported.

But I didn't see analyses purporting to explain variance in CRT in which religion was treated as mediating the influence of conservativism or other influences thought to explain variance in religiosity. Is that in the supplementary materials, maybe?

I myself don't get even a zero-order correlation between a composite ideology & party self-identification measure, on one hand, and CRT, on other, when I test on general population. So there's not issue about an effect being mediated by some other variable like relgiosity.

But I am guessing, too, that if one starts to plug in lots of covariates, any ideology -> cognitive style measure relationship is going to disappear. This is more a theory/modeling issue, but I think the zero-order correlation is what someone who believes that "ideologies" cohere with cognitive style should be looking at. It would be odd, e.g., to say, "sure ideology x is negatively correlated with cognitive-style measure y, but once you partial out the negative correlation between ideology x & education that disappears." If low education people are, say, less reflective but are also attracted to ideology x rather than ideology y, well, that's something that the proponent of the cognitive style/ideology connection has to deal with too. (Actually, I've found in some samples a small positive correlation between self-identification w/ the Republican Party & CRT; it gets obliterated if one regresses CRT on party id, gender, education, race etc.)

I should also point out that I also consistenly find a negative correlation between CRT & religiosity in general population samples -- something consistent w/ both of the studies I am familiar w/ that you have done on religion & cognition (which as you advert to, have used a variety of samples).

For everyone else -- I highly recommend people reading these great studies!

Pennycook, G., Cheyne, J., Barr, N., Koehler, D. & Fugelsang, J. Cognitive style and religiosity: The role of conflict detection. Memory & Cognition, 1-10 (2013).

Pennycook, G., Cheyne, J.A., Seli, P., Koehler, D.J. & Fugelsang, J.A. Analytic cognitive style predicts religious and paranormal belief. Cognition 123, 335-346 (2012).

July 16, 2013 | Registered CommenterDan Kahan

@Dan:

That is correct - the analysis wasn't reported in the paper. We (perhaps unfortunately) didn't bother to further discuss any of the "additional" zero-order correlations that were reported in Tables 1 or 3. We focused completely on religiosity and paranormal beliefs.

With respect to your failure to find correlations between the CRT and ideology measures, it perhaps has something to do with the types of items you're using. I've often found a modest correlation between the CRT and social conservatism but never (if memory serves) with fiscal conservatism*. Obviously, social conservatism is much more strongly related to religious belief than fiscal conservatism (r = .48 v .25, for example, in one unpublished MTurk sample). Moreover, Libertarians tend to be more dispositionally analytical (as per the Iyer, Koleva, Graham, Ditto, & Haidt paper recently published in PLOS ONE**), which further complicates the matter.

The religious belief mediation of CRT<->social conservatism hasn't required any other covariates in any of the studies that I've run. The beta drops to < -.05 once religious belief is entered in the regression.

I definitely believe that it is worthwhile to look at the zero-order correlation between cognitive style and conservatism. Conservatives do seem to be generally less analytic than liberals (with the exception of Libertarians). Although this may not tell us much about the cognitive underpinnings of conservatism per se (assuming that this relation is in fact mediated by religious belief), it could potentially help us further understand some of the large differences between conservatives and liberals at the societal level.


*We used a single item for each. Participants simply rated their degree of "social" and "fiscal" conservatism on two 5 point scales from Strongly Liberal to Strongly Conservative.
**Here is the link to the Iyer et al. (2012) paper: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0042366

July 16, 2013 | Unregistered CommenterGordon Pennycook

@Gordon:

Many thanks!

I'm familiar w/ Iyer et al.; I actually advert to it in the post. As you can see, I think the sampling bias there is much more obvious than the one w/ MTurk.

I find *very* small correlatoins between CRT and the "culture measures" we use in our studies. Hierarchy is essentially social conservativism and individualism more of a pro-market form (the interactions are more interesting, though, than anything they do on their own).

But the real point for me is that there's little reason to think that cognitive reflection measures vulnerability to the sort of ideologically motivated reasoning that seems to generates polariztion over climate change & other issues that turn on contested empirical evidence.

On the contary, CRT seems to magnify motivated reasoning!

July 16, 2013 | Registered CommenterDan Kahan

@Joshua:

1. Intersting link. I'll read closely -- and with effort to stifle my instinct to dismiss the results based on the method, which looks (on quick inspection) to be of the sort that makes the classification of observations far too vulnerable to the researchers' desire to find confirmation of their hypotheses.

2. On the topic of "off topic": you might want to tune in to the discussion of "strong/weak proof" of motivated reasoning. It has merged w/ one we were having about how vistual continuity" and "motivated reasoning" might ineract--or how one might miss a gorilla dribbling a basketball at an abortion clinic protest if in fact it was the gorilla & not the protestors blocking access to the facility.

July 17, 2013 | Registered CommenterDan Kahan

Thank you so much. This is incredibly useful and important for social scientists. Really appreciated.

January 23, 2014 | Unregistered CommenterEkaterina Damer

@Ekaterina:
Welcome! Glad it was useful to you.
--Dan

January 24, 2014 | Registered CommenterDan Kahan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>