As the nation continues to be convulsed by polarized debate and street demonstrations following last week's publication of Chandler, Mueller & Paolacci, Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers, Behavioral Methods (advance on-line 2013), CCP Blog is proud to present ... an EXCLUSIVE guest post from Jesse Chandler, lead author of this important article! Jesse offers his views on the critique I posted on the validity of MT samples for studying the interaction of culture, cognition, and perceptions of risk and other policy-relevant facts.
I wanted to elaborate a bit on some of the issues that Dan raised, amplifying some of his concerns about generalizability, but also mounting something of a defense of Mechanical Turk workers, and their use in research. In brief, I wanted to reinforce the point that he made about understanding the sample that is used and consciously deciding what inferences to make from it. However, I also wanted to push back a bit on his claim that MTurk is demonstrably worse than other methods of collecting data. Along the way, I also have to dispute his characterization of MTurk workers as liars and frauds.
MTurk is not representative, but it is more representative than many samples researchers currently use
As Dan notes, Mechanical Turk is, in principle, fine as a sample for any research study for which college students are currently deemed “representative enough” (which is a larger proportion of the social sciences than the site’s readers may appreciate). If anything, MTurk samples are more representative than other convenience samples, and discovering that a finding observed among college students is robust in a different and more homogeneous population is useful.
Moreover, in the social sciences it should generally be assumed that any observed process generalizes unless there is a reason to think that it does not (nothing insidious here, just Occam’s razor). If a researcher believes that a finding would not replicate on another population, then they should try to replicate it across both samples and compare results. Ideally, they have a reason why they expect the populations to differ that they can articulate, operationalize and use in meditational analysis. In other words, concerns about the validity of findings on MTurk represent an opportunity to advance theory, not a reason to dismiss findings out of hand.
Know thy sample
Perhaps more importantly, I think Dan is spot on in emphasizing the importance of understanding the sample one is using and the question being asked. “Representative enough” is clearly not suitable for some research questions, and some inferences do not logically follow from non-representative samples. Likewise, for researchers interested in specific populations, MTurk results may vary. Some populations (like conservatives) may be missing or underrepresented in this sample, which is bad for Dan. Other populations, like the unemployed,underemployed and socially anxious may be over-represented which is great for someone else. For researchers with limited budgets who work at homogeneous colleges, some populations, like people from other cultures or who speak other languages may only be available on MTurk.
Another closely related point Dan alludes to that I also want to reemphasize is that the constituents of a particular MTurk sample cannot be taken for granted. Workers are not randomly selected from the pool of available workers and assigned to studies. They choose what they want to participate in. While there are ways to convert selection bias based on study content into attrition (e.g. by placing the consent form after workers accept the HIT), other procedural factors may influence who completes a HIT. We show, for example, that if a HIT makes it onto Reddit, the sample can end up much younger and disproportionately male. It is likely that sample characteristics may also depend on other variables including the requester’s reputation, the sample size, payment and the minimum reputation of the recruited workers (none of which have been thoroughly studied).
It is important to collect relevant variables from participants directly, rather than only appealing to the demographic characteristics collected by other researchers. Very simple demographic differences can fundamentally change point estimates on survey responses. As Dan notes, MTurk is overwhelmingly pro-Obama. There might be a complicated reason for this, but it may also reflect the fact that American MTurk workers are more likely to be young, lower income, and female - all of these demographic characteristics predict more support for Obama.
Dan thinks the Internet is full of weirdos and frauds.
Despite agreeing with the spirit of Dan's comments, I have to take issue with his argument that Mechanical Turk workers are more likely to engage in immoral behavior than other samples, and thus MTurk samples are inferior to other kinds of panel data.
I take particular issue with these claims because the take home implication from them is that data provided by MTurk workers is less credible, not because the workers are a non-representative population, but because the data are more likely to fabricated than data obtained from other sources. If this were true, this issue of internal validity would be a far more serious threat to the usefulness of findings on MTurk and would call into question all data collected on it. However, there is little evidence to suggest these concerns are true. These are comparative arguments for which comparative data does not exist, and often even the data for MTurk itself is missing or misleading.
Yes, adult content exists on MTurk, but workers must opt in to view HITs that may contain adult content (including flagging it for removal from non-adult sites). Around 80 000 workers have opted to do so. We don’t know how many workers actually view this content, let along how this proportion compares to the population of Internet users who watch adult content.
Yes, some workers probably intend to engage in fraudulent behavior on MTurk. Again, we don’t know how many workers do this. Dan notes that a large proportion of posted HITs commit fraud, in the sense that they ask requesters to “like” social media posts contrary to Amazon’s ToS. Taking this as evidence for worker fraud relies on the assumptions that i.) these HITs are actually completed, ii.) by workers in general and not just a subsample of super productive fraudsters (analogous to our research superturkers), that iii.) there is overlap between the sample that completes spam HITs and research HITs and iv.) that workers even understand that this is a fraudulent activity (Dan read the Amazon’s terms of service, but hey, he is a lawyer).
Another variation of the argument that workers are somehow fundamentally strange comes from the question “who would work for $1.50 an hour?” If I had to guess who works for these low wages, I would say that it is the large number of long term unemployed and other people living at the margin in a country muddling through an economic catastrophe. Although MTurk pays little, the money it does pay is at the margin. Moreover, there may be good reasons why workers accept low wages: MTurk work is flexible enough to be completed in slack time, and accommodate other life commitments (for a discussion see here). Also, we live in a country where people pay to click mice repeatedly. Knowing this, it is not so surprising that people will do the same to earn money. I would not be surprised though if different workers had different reserve wages, and if sample characteristics changed as a function of wages, or in response to external economic conditions.
Workers are people. Don’t be surprised if they act like… people
Problems with worker data quality do not need to be explained by pathologizing workers. Many of the issues that vex researchers could arise from workers acting basically like ordinary people.
Workers will lie or distort the truth if incentivized to do so. Indeed, research shows that MTurk workers lie for money (see here), but a close reading of the paper will show that they may lie less than “real world” participants who participated in similar studies on participant honesty. This may explain why workers misreport US residency. US workers are paid in cash, those in many other countries are paid in Amazon credit.
Workers like other people are forgetful and workers who “refuse” to opt out of studies they have already completed should surprise nobody. Large proportions of people forget things like spending the night in a hospital or being the victim of a violent crime (see here), all of which are more important to their lives then Study 3 of your dissertation. Researchers who want to avoid duplicate workers (they should) should make life easy for both workers and themselves by preventing duplicates automatically.
It is true that you cannot know what work workers have completed for other researchers, but these concerns can be greatly reduced if researchers took the time to create their own stimuli. I am sometimes surprised at the laziness of researchers. Gabriele Paolacci and I used a simple attention check (“have you ever had a fatal heart attack”) once, three years ago. We mentioned this in a paper and it shows up verbatim all the time on MTurk. The Oppenheimer “Instructional Manipulation Check” is also frequently copied verbatim. Seriously. Stop. It. Now.
If there is one thing that workers hate, it is negative feedback. This means they will generally bend over backwards to accommodate requesters. They generally understand that researchers do not like people talking about the contents of their HITs and try to avoid this. When they do communicate information, they seem to assume that the details they reveal will not matter, and methodologically problematic slips (e.g. discussing information in one condition but not another) are inadvertent. However, they also hate it when requesters reject their work because they failed an “attention check.” From a worker’s perspective, this probably feels unfair in only the way that an elite private school refusing to give out nickels can. Oh, and this problem is not unique to MTurk, sharing information for mutual benefit happens in college samples too.
Are Panels any Good?
All of these concerns about the quality of data collected on MTurk assume that workers are somehow different from respondents in other sample pools, and that these issues will simply go away if only data were collected somewhere else. This may be true, but how much do we really know about panel respondents and panel data quality? It is unfair to compare observed data in MTurk against a Platonic ideal of a survey panel. If MTurk workers lie to be eligible for studies (like our malingerers), why wouldn’t panel members lie for yet larger incentives? Likewise, if we are going to worry that MTurk samples are not representative because workers look at naked people in the internet, then perhaps we should worry about whether panels built using random digit dialing are representative, given that almost every normal person screens their calls.
Researchers who use other pay panels should be as critical toward these samples as Dan would like us to all be toward Mechanical Turk. Paid sources vary a lot in methodology and it is likely that beyond differences in how they are supposed to be constructed, there are yet larger differences in how the panel design is executed. Research always seems cleaner when you don’t know how the sausage gets made. Dig deep. Get worried. While data quality, representativeness and honesty may be issues that are particularly salient for MTurk samples, we (as in social scientists who are not survey research methodologists) may simply know more about their issues because the sample is relatively transparent and somebody happened to look.
The Take Home Message
In sum, Dan notes issues with Mechanical Turk that I agree are potential problems. However, I think the most important lessons that can be drawn from this discussion are what questions to ask about our hypotheses and our sample, and how to collect data from them, rather than who to collect data from. Further the solutions to the problems he identifies lie ultimately in better research design, with or without finding a better sample population.