A Pigovian tax solution (for now) for review/publication of studies that use M Turk samples

I often get asked to review papers that use M Turk samples.

This is a problem because I think M Turk samples, while not invalid for all forms of study, are invalid for studies of how individual differences in political predispositions and cognitive-reasoning proficiencies influence the processing of empirical information relevant to risk and other policy issues.

I’ve discussed this point at length.

And lots of serious scholars now have engaged this isssue seriously.

“Seriously” not in the sense of merely collecting some data on the demographics of M Turk samples at one point in time and declaring them “okay” for all manner of studies once & for all. Anyone who produces a study like that, or relies on it to assure readers his or her own use of an M Turk sample is “okay,” either doesn’t get the underlying problem or doesn’t care about it.

I mean really seriously in the sense of trying to carefully document the features of the M Turk work force that bear on the validity of it as a sample for various sorts of research, and in the sense of engaging in meaningful discussion of the technical and craft issues involved.

I myself think the work and reflections of these serious scholars reinforce the conclusion that it is highly problematic to rely on M Turk samples for the study of information processing relating to risk and other facts relevant to public policy.

The usual reply is, “but M Turk samples are inexpensive! They make it possible for lots & lots of scholars to do and publish empirical research!”

Well, thought experiments are even cheaper. But they are not valid.

If M Turk samples are not valid, it doesn’t matter that they are cheap. Validity is a non-negotiable threshold requirement for use of a particular sampling method. It’s not an asset or currency that can be spent down to buy “more” research– for the research that such a “trade off” subsidizes in fact has no value.

Another argument is, “But they are better than university student samples!” If student samples are not valid for a particular kind of research, then journals shouldn’t accept studies that use them either. But in any case, it’s now clear that M Turk workers don’t behave the way U.S. university students do when responding to survey items that assess whether subjects are displaying the sorts of reactions one would expect in people who claim that they are members of the U.S. public with particular political outlooks (Krupnikov & Levine 2014).

I think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for.

But in any case, they ought at least to adopt policies one way or the other–rather than put authors in the position of not knowing before they collect the data whether journals will accept their studies, and authors and reviewers in the position of having a debate about the appropriateness of using such a sample over & over. Case-by-case assessment is not a fair way to handle this issue, nor one that will generate a satisfactory overall outcome.

So … here is my proposal:

Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should oblige authors who do use M Turk samples to give a full account–in their paper– of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public. The explanation should consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever. Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address.

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of the US general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013) Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses. However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects. If authors are purporting to empirically test hypotheses about about how members of the U.S. general public reason on politically contested matters, why don’t they see the incentive of M Turk workers to misrepresent their nationality as a decisive objection to using them as their study sample?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives are grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012). One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do. It’s easy to imagine that the M Turk US conservatives (if really from the US) lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensitivities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative US conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues. So why is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study that nevertheless uses an M Turk sample.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither designed nor warranted to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them anytime and has on multiple occasions, and is completely oblivious to what researchers care about. A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to. But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from for any sort of study in which subject communication can will contaminate the sample. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.

I feel pretty confident M Turk samples are not long for this world for studies that examine individual differences in reasoning relating to politically contested risks and other policy-relevant facts (again, there are no doubt other research questions for which M Turk samples are not nearly so problematic).

Researchers in this area will not give much weight to studies that rely on M Turk samples as scholarly discussion progresses.

In addition, there is a very good likelihood that an on-line sampling resource that is comparably inexpensive but informed by genuine attention to validity issues will emerge in the not too distant future.

E.g., Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete– a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.

Google Consumer Surveys has proven its validity in the only way that a survey mode–random-digit dial, face-to-face, on-line –can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections. Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time.

As I said, Amazon couldn’t care less whether the recruitment methods it uses for M Turk workers now or in the future make them suited for scholarly research.

The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.

But I predict that will change.

We’ll see.

But in the meantime, obliging researchers who think it is “okay” to use M Turk samples to explain why they apparently are untroubled by the serious issues being raised about the validity of these samples would be an appropriate way, it seems to me, to make those who use such samples to internalize the cost that polluting the research environment with M Turk studies is imposing on social science research on cognition and political conflict.

Refs

Chandler, J., Mueller, P. & Paolacci, G. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior research methods 46, 112-130 (2014).

Haidt, J. & Hersh, M.A. Sexual morality: The cultures and emotions of conservatives and liberals. J Appl Soc Psychol 31, 191-221 (2001).

Kahan, D. Fooled Twice, Shame on Who? Problems with Mechanical Turk Study Samples. Cultural Cognition Project (2013a), https://www.culturalcognition.net/blog/2013/7/10/fooled-twice-shame-on-who-problems-with-mechanical-turk-stud.html

Krupnikov, Y. & Levine, A.S. Cross-Sample Comparisons and External Validity. Journal of Experimental Political Science 1, 59-80 (2014).

Richey, S,., & Taylor, B. How Representatives Are Amazon Mechanical Turk Workers? The Monkey Cage,(2012).

Shapiro, D.N., Chandler, J. & Mueller, P.A. Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science 1, 213-220 (2013).