Fooled twice, shame on who? Problems with Mechanical Turk study samples, part 2

This is the second post in a two-part series on what I see as the invalidity of studies that use samples of Mechanical Turk workers to test hypotheses about cognition and political conflict over societal risks and other policy-relevant facts.

In the first, I discussed the concept of a “valid sample” generally. Basically, I argued that it’s a mistake to equate sample “validity” with any uniform standard or any single, invariant set of recruitment or stratification procedures.

Rather, the validity of the sample depends on one thing only: whether it supports valid and reliable inferences about the nature of the psychological processes under investigation.

College student samples are fine, e.g., if the dynamic being studied is reasonably understood to be uniform for all people.

A nonstratified general population sample will be perfectly okay for studying processes that vary among people of different characteristics so long as (1) there are enough individuals from subpopulations whose members differ in the relevant respect and (2) the recruitment procedure didn’t involve methods that might have either discouraged participation by typical members of those groups or unduly encouraged participation by atypical ones.

Indeed, a sample constructed by methods of recruitment and stratification designed to assure “national representativeness” might not be valid (or at least not support valid inferences) if the dynamic being studied varies across subgroups whose members aren’t represented in sufficient number to enable testing of hypotheses relating specifically to them.

Etc.

Now I will explain why, on the basis of this pragmatic understanding of what sample validity consists in, MT samples aren’t valid for the study of culturally or ideologically grounded forms of “motivated reasoning” and like dynamics that it is reasonable to believe account for polarization over climate change, gun control, nuclear power, and other facts that admit of empirical study.

I don’t want to keep anybody in suspense (or make it necessary for busy people to deal with more background than they think they need or might already know), so I’ll just start by listing what I see as the three decisive “sample validity” problems here. I’ll then supply a bit more background—including a discussion of what Mechanical Turk is all about, and a review of how this service has been used by social scientists—before returning to the three validity issues, which I’ll then spell out in greater detail

Ready? Here are the three problems:

1. Selection bias. Given the types of tasks performed by MT workers, there’s good reason to suspect subjects recruited via MT differ in material ways from the people in the world whose dispositions we are interested in measuring, particularly conservative males.

2. Prior, repeated exposure to study measures. Many MT workers have participated multiple times in studies that use performance-based measures of cognition and have discussed among themselves what the answers are. Their scores are thus not valid.

3. MT subjects misrepresent their nationality. Some fraction of the MT work force participating in studies that are limited to “U.S. residents only” aren’t in fact U.S. residents, thereby defeating inferences about how psychological dynamics distinctive of U.S. citizens of diverse ideologies operate.

That’s the short answer. Now some more detail.

A. What is MT? To start, let’s briefly review what Mechanical Turk is—and thus who the subjects in studies that use MT samples are.

Operated by Amazon.com, MT is essentially an on-line labor market. Employers, who are known as “requesters,” post solicitations for paid work, which can be accepted by “workers,” using their own computers.

Pay is very modest: it is estimated that MT workers make about $1.50/hr.

The tasks they perform are varied: transcription, data entry, research, etc.

But MT is also a well-known instrument for engaging in on-line fraud.

MT workers get paid for writing fake product or service reviews—sometimes positive, sometimes negative, as the requester directs.

They can also garner a tiny wage for simply “clicking” on specified links in order to generate bogus web traffic at the behest of “requesters” who themselves have contracted to direct visitors to legitimate websites, who are in this case the victims of the scam.

These kinds of activities are contrary to the Amazon.com “terms of use” for MT, but that doesn’t restrain either “requesters” from soliciting “workers” or “workers” form agreeing to engage in them.

Another common MT labor assignment—one not contrary to MT rules—is the indexing of sex acts performed in internet pornography.

B. The advent of MT “study samples.” A lot of MT workers take part in social science studies. Indeed, many workers take part in many, many such studies.

The appeal of using MT workers in one’s study is pretty obvious. They offer a reasearcher a cheap, abundant supply of eager subjects. In addition, for studies that examine dynamics that are likely to vary across different subpopulations, the workers offer the prospect of the sort of diversity of characteristics one won’t find, say, in a sample of college students.

A while back researchers from a variety of social science disciplines published studies aimed at “validating” MT samples for research that requires use of diverse subjects drawn from the general population of the U.S. Encouragingly, these studies reported that MT samples appeared reasonably “representative” of the general population and performed in manners comparable to how one would expect members of the general public generally to perform.

On this basis, the floodgates opened, and journals of all types—including ones in elite journals—began to publish studies based on MT samples.

To be honest, I find the rapidity of the decision of these journals to embrace MT samples mystifying.

Even taking the initial studies purporting to find MT samples “representative” at face value, the fact remains that Amazon is not in the business of supplying valid social science research samples. It is in the business (in this setting) of brokering on-line labor contracts. To satisfy the booming demand for such services, it is constantly enrolling new “workers.” As it enlarges its MT workforce, Amazon does nothing—zip—to assure that the characteristics of its “workers” won’t change in ways that make them unsuited for social science research.

In any case, the original papers—which reflect data that are now several years old—certainly can’t be viewed as conferring a “life time” certification of validity on MT samples. If journals care about sample validity, they need to insist on up-to-date evidence that MT samples support valid inferences relating to the matters under investigation.

The most recently collected evidence—in particular Chandler, Mueller, Paolacci (in press) [actually, now published!] & Shapiro, Chandler & Mueller (2013)—doesn’t justify that conclusion. On the contrary, it shows very convincingly that MT samples are invalid, at least for studies of individual differences in cognition and their effect on political conflict in the U.S.

C. Three major defects MT samples for the study of culturally/ideological motivated reasoning

1. Selection bias

Whatever might have been true in 2010, it is clear that the MT workforce today is not a picture of America.

MT workers are “diverse,” but are variously over- and under-representative of lots of groups.

Like men: researchers can end up with a sample that is 62% female.

African Americans are also substantially under-represented: 5% rather than the 12% they make up in the general population.

There are other differences too but the one that is of most concern to me—because the question I’m trying to answer is whether MT samples are valid for study of cultural cognition and like forms of ideologically motivated reasoning—is that MT grossly underrepresents individuals who identify themselves as “conservatives.”

This is clear in the frequencies that researchers relying on MT samples report. In Pennycook et al. (2012), e.g., 53% of the subjects in their sample self-identified as liberal and 25% identified as conservative. Stratified national surveys (from the same time as this study) suggest that approximately 20% of the general population self-identifies as liberal and 40% as conservative.

In addition to how they “identify” themselves, MT worker samples don’t behave like ones that consisted of ordinary U.S. conservatives (a point that will take on more significance when I return to their falsification of their nationality). In an 2012 Election Day survey, Richey & Taylor (2012) report that “73% of these MTurk workers voted for Obama, 15% for Romney, and 12% for ‘Other’ ” (this assumes we can believe they were eligible to vote in the U.S. & did; I’ll get to this).

But the reason to worry about the underrepresentation of conservatives in MT samples is not simply that the samples are ideologically “unrepresentative” of the general population. If that were the only issue, one could simply oversample conservatives when doing MT studies (as I’ve seen at least some authors do).

The problem is what the underrepresentation of conservatives implies about the selection of individuals into the MT worker “sample.” There’s something about being part of the MT workforce, obviously, that is making it less appealing to conservatives.

Maybe conservatives are more affluent and don’t want to work for $1.50/hr.

Or maybe they are more likely to have qualms about writing fake product reviews or watching hours of porn and indexing various sex acts. After all, Jonathan Haidt & others have found that conservatives have more acute disgust sensibilities than liberals.

But in any case, since we know that conservatives by and large are reticent to join the MT workforce, we also can infer there is something different about the conservatives who do sign up from the ones who don’t.

What’s different about them, moreover, might well be causing them to respond differently in studies from how ordinary conservatives in the U.S. population would. There must be if we consider how many of them claim to have voted for Obama or a third-party candidate in the 2012 election!

If they are less partisan, then, they might not demonstrate as strong a motivated reasoning effect as ordinary conservatives would.

Alternatively, their decision to join the MT workforce might mean they are less reflective than ordinary conservatives and are thus failing to ponder the incongruity between indexing porn, say, and their political values.

For all these reasons, if one is interested in learning about how dispositions to engage in systematic information processing are affected by ideology, one just can’t be sure that what we see in “MT conservatives” will generalize to the real-world population of conservatives.

I’ve seen one study based on an MT sample that reports a negative correlation between “conservativism” and scores on the Cognitive Reflection Test, the premier measure of the disposition to engage in conscious, effortful assessment of evidence—slow, “System 2” in Kahneman’s terms—as opposed the rapid, heuristic-driven, error-prone evidence neglectful sort (“System 1”).

That was the study based on the particular MT sample I mentioned as grossly overrepresenting liberals and underrepresenting conservatives.

I’ve collected data on CRT and ideology in multiple general population surveys—ones that were designed to and did generate nationally representative panels by using recruitment and stratification methods validated by the accuracy of surveys using them to predict national election results. I consistently find no correlation between ideology and CRT.

In short, the nature of the MT workforce—what it does, how it is assembled, and what it ends up generating—makes me worry that the underrepresentation of conservatives reflects a form of selection bias relative to the sort of individual differences in cognition that I’m trying to measure.

That risk is too big for me to accept in my own research, and even if it weren’t, I’d expect it to be too big for many consumers of my work to accept were they made aware of the problem I’m identifying

BTW, the only other study I’ve ever seen that reports a negative correlation between conservativism and CRT also had serious selection bias issues. That study used subjects enticed to participate in an experiment at an internet site that is targeted to members of the public interested in moral psychology. As an incentive to participate in the study, researchers promised to tell the subjects what their study results indicated about their cognitive style. One might think that such a site, and such an incentive, would appeal only to highly reflective people, and indeed the mean CRT scores reported for study participants (liberals, conservatives, and libertarians) rivaled or exceeded the ones attained by students at elite universities and were (for all ideological groups) much higher than those typically attained by members of the general public. As a colleague put it, purporting to infer how different subgroups will score on the CRT from such a sample is the equivalent of a researcher reporting that “women like football as much as men” based on a sample of visitors to ESPN.com!

2. Pre- & multiple-exposure to cognitive performance measures

Again, Amazon.com isn’t in the business of furnishing valid study samples. One of the things that firms that are in that business do is keep track of what studies subjects they recruit have participated in so that researchers won’t be testing people repeatedly with measures that don’t generate reliable results in subjects who’ve already been exposed to them.

The Cognitive Reflection Test fits that description. It involves three questions, each of which seems to have an obvious answer that is in fact wrong; people disposed to search for and reflect on evidence that contradicts their intuitions are more likely to get those answers right.

But even the most unreflective, visceral thinker is likely to figure out the answers eventually, if he or she sees the questions over & over.

That’s what happens on M Turk. Subjects are repeatedly recruited to participate in studies on cognition that use the CRT and similar test of cognitive style.

What’s more they talk about the answers to such tests with each other. MT workers have on-line “hangouts” where they share tips and experiences. One of things they like to talk about are the answers to the CRT. Another is why researchers keep administering an “intelligence test” (that’s how they interpret the CRT, not unreasonably) that we clearly know the answers to?

These facts have been documented by Chandler, Mueller, and Paolacci in an article in press [now out–hurry & get yours before news stand sells out!] in Behavior Research Methods.

Not surprisingly, MT workers achieve highly unrealistic scores on the CRT, ones comparable to those recorded among students at elite universities and far above those typically reported for general population samples.

Other standard measures relating to moral reasoning style–like the famous “trolley problem”–also get administered to and answered by the same MT subjects over & over, and discussed by them in chat forums. I’m guessing that’s none to good for the reliablility/validity of responses to those measures either.

As Chandler, Mueller, Paolacci note,

There exists a sub-population of extremely productive workers which is disproportionately likely to appear in research studies. As a result, knowledge of some popular experimental designs has saturated the population of those who quickly respond to research HITs; further, workers who read discussion blogs pay attention to requester reputation and follow the HITs of favored requesters, leading individual researchers to collect fans who will undoubtedly become familiar with their specific research topics.

There’s nothing that an individual researcher can effectively do to counteract this problem. He or she can’t ask Amazon for help: again, it isn’t a survey firm and doesn’t give a shit whether its workforce is fit for participation in social science studies.

The researcher can, of course, ask prospective MT “subjects” to certify that they haven’t seen the CRT questions previously. But there is a high probability that the workers—who know that their eligibility to participate as a paid study subject requires such certification—will lie.

MT workers have unique id numbers. Researchers have told me that they have seen plenty of MT workers who say they haven’t taken the CRT before but who in fact have—in those researchers’ own studies. In such cases, they simply remove the untruthful subject from their dataset.

But these and other researchers have no way to know how many of the workers they’ve never themselves tested before are lying too when they claim to be one of the shrinking number of MT workers who have never been exposed to the CRT.

So researchers who collect data on performance-based cognition measures from MT workers really have no way to be sure that these very high-scoring subjects are genuinely super reflective or just super dishonest.

I sure wouldn’t use take a risk like this in my own research. And I’m also not inclined to take the risk of being misled by relying on studies of searchers who have disregarded it in reporting how scores on CRT or other cognitive performance measures relate to ideology (or religion or any other individual difference of interest).

3. Misrepresentation of nationality (I know who these guys are; but who are MT workers? I mean—really?)

Last but by no means least: Studies based on MT samples don’t support valid inferences about the interaction of ideology and cognition in polarizing U.S. policy debates because it’s clear that some fraction of the MT subjects who claim to be from the U.S. when they contract to participate in a study aren’t really from the United States.

This is a finding from Shapiro, Chandler and Muller (2013), who in a survey determined that a “substantial” proportion of the MT workers who are “hired” for studies with “US only” eligibility are in fact participating in them via foreign internet-service providers.

I also know of cases in which researchers have detected MT subjects using Indian IP addresses participating in their “US only” studies.

Amazon requires MT workers to register their nationality when joining the MT labor force. But because MT workers recognize that some “requesters” attach “US worker only” eligibility criteria to their labor requests, MT workers from other countries—primarily India, the second largest source of MT labor outside the U.S.—have an incentive to misrepresent their nationality.

I’m not sure how easy this is to pull off since Amazon now requires US citizens to supply Social Security numbers and non-US citizens who reside in the US to supply comparable information relevant to tax collection.

But it clearly isn’t impossible for determined, internet-savvy and less-than-honest people to do.

Part of pulling off the impersonation of a US resident involves signing up for MT through an account at a firm that uses a VPN to issue US IP addresses to internet users outside the U.S. Indeed, aspiring non-US MT workers have an even bigger incentive to do that now because Amazon, in response to fraudulent use of its services, no longer enrolls new non-US workers into the MT labor force.

Shapiro, Chandler & Muller recommend checking the IP addresses of subjects in “US only” studies and removing from the sample those whose IP addresses showed they participated from India or another country.

But this is not a very satisfying suggestion. Just as MT workers can use a VPN to misrepresent themselves as U.S.-residents when they initially enroll in MT, so they can use a VPN to disguise the location from which they are participating in U.S.-only studies.

Why wouldn’t they? If they didn’t lie, they might not be eligible to “work” as a study subjects–or work period if they signed up after the period in which Amazon stopped enrolling non-US workers.

True, lying is dishonest. But so are a great many of the things that MT workers routinely do for paying MT requesters.

Charmingly, Shapiro, Chandler and Muller (2013) also found that MT subjects, who are notorious for performing MT tasks at the office when they are supposed to be working, score high on a standard measure of the disposition to engage in “malingering.”

That’s a finding I have complete confidence in. Remember, samples that are not “valid” for studying certain types of dynamics can still be perfectly valid for studying others.

The name for Amazon’s “Mechanical Turk” service comes from a historical episode in the late 18^th century in which a con artist duped amazed members of the public into paying him a small fee for the chance to play chess against “the Turk”—a large, turban-wearing, pipe-smoking manikin who appeared to be spontaneously moving his own pieces with his mechanized arm and hand.

The profitable ruse went on for decades, until finally, in the 1820s, it was discovered that the “Turk” was being operated by a human chess player hidden underneath its boxy chassis.

Today social scientists are lining up to pay a small fee—precisely because it is so much smaller than what it costs to recruit valid general population sample—to collect data on Amazon’s “Mechanical Turk.”

But if the prying open of the box reveals that the subjects performing the truly astonishing feats of cognition being observed in these researchers’ studies are “malingering” college students in Mumbai posing as “U.S. Democrats” and “Republicans” in between jobs writing bogus product reviews and cataloging sex acts in on-line porn clips, I suspect these researchers will feel more foolish than anyone who paid to play chess with the original “Turk.”

Some references

Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2011). Using Mechanical Turk as a subject recruitment tool for experimental research. Political Analysis, 20(3), 351-368.

Chandler, J., Mueller, P., & Paolacci, G. Methodological Concerns and Advanced Uses of Crowdsourcing in Psychological Research (in press) Behavior Research Methods.

Experimental Turk: a blog on social science experiments on Amazon Mechanical Turk

Mueller, Chandler & Paolacci, Advanced uses of Mechanical Turk in psychological research, presentation at Society for Personality & Social Psychology, Jan. 28, 2012.

Pennycook, G., Cheyne, J. A., Seli, P., Koehler, D. J., & Fugelsang, J. A. (2012). Analytic cognitive style predicts religious and paranormal belief. [doi: 10.1016/j.cognition.2012.03.003]. Cognition, 123(3), 335-346.

Richey, S,., & Taylor, B. How Representatives Are Amazon Mechanical Turk Workers? The Monkey Cage,(2012).

Shapiro, D. N., Chandler, J., & Mueller, P. A. (2013). Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science. doi: 10.1177/2167702612469015

Update on Wednesday, July 10, 2013 at 11:00AM

Whoa– hold the presses!

The Chandler, Mueller & Paolacci “in press” paper is no longer merely “in press.” I am advised that the “advance on line” version “just went up” — & sure enough:

I’ll read closely & if there’s anything in the published version that merits amplification/qualification/revision/repudiation/immolation etc. of anything in the post, I will be sure to note that.