Key Insight

I often get asked to review papers that use M Turk samples. This is a problem because I think M Turk samples, while not invalid for all forms of study, are invalid for studies of how individual differences in political predispositions and cognitive-reasoning proficiencies influence the processing of empirical information relevant to risk and other policy issues. I’ve discussed this ... Read more

I often get asked to review papers that use M Turk samples.

This is a problem because I think M Turk samples, while not invalid for all forms of study , are invalid for studies of how individual differences in political predispositions and cognitive-reasoning proficiencies influence the processing of empirical information relevant to risk and other policy issues.

I’ve discussed this point at length.

And lots of serious scholars now have engaged this isssue seriously.

“Seriously” not in the sense of merely collecting some data on the demographics of M Turk samples at one point in time and declaring them “okay” for all manner of studies once & for all. Anyone who produces a study like that, or relies on it to assure readers his or her own use of an M Turk sample is “okay,” either doesn’t get the underlying problem or doesn’t care about it.

I mean really seriously in the sense of trying to carefully document the features of the M Turk work force that bear on the validity of it as a sample for various sorts of research, and in the sense of engaging in meaningful discussion of the technical and craft issues involved .

I myself think the work and reflections of these serious scholars reinforce the conclusion that it is highly problematic to rely on M Turk samples for the study of information processing relating to risk and other facts relevant to public policy .

The usual reply is, “but M Turk samples are inexpensive ! They make it possible for lots & lots of scholars to do and publish empirical research!”

Well, thought experiments are even cheaper.  But they are not valid .

If M Turk samples are not valid , it doesn’t matter that they are cheap. Validity is a non-negotiable threshold requirement for use of a particular sampling method. It’s not an asset or currency that can be spent down to buy “more” research– for the research that such a “trade off” subsidizes in fact has no value .

Another argument is, “But they are better than university student samples!”  If student samples are not valid for a particular kind of research, then journals shouldn’t accept studies that use them either. But in any case, it’s now clear that M Turk workers don’t behave the way U.S. university students do when responding to survey items that assess whether subjects are displaying the sorts of reactions one would expect in people who  claim that they are members of the U.S. public with particular political outlooks (Krupnikov & Levine 2014).

I think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for.

But in any case, they ought at least to adopt policies one way or the other–rather than put authors in the position of not knowing before they collect the data whether journals will accept their studies, and authors and reviewers in the position of having a debate about the appropriateness of using such a sample over & over.  Case-by-case assessment is not a fair way to handle this issue, nor one that will generate a satisfactory overall outcome.

Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should oblige authors who do use M Turk samples to give a full account–in their paper– of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public.  The explanation should  consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever.  Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address .

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of the US general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013)  Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses.  However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects .  If authors are purporting to empirically test hypotheses about about how members of the U.S. general public reason on politically contested matters, why don’t they see the incentive of M Turk workers to misrepresent their nationality as a decisive objection to using them as their study sample?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives are grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012).  One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do.  It’s easy to imagine that the M Turk US conservatives (if really from the US) lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensitivities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative US conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues.  So why is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study that nevertheless uses an M Turk sample.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither designed nor warranted to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them anytime and has on multiple occasions, and is completely oblivious to what researchers care about.  A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to.  But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from for any sort of study in which subject communication can will contaminate the sample. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so ; that’s information that many critical readers will find helpful in evaluating their work.