clock menu more-arrow no yes

Scientists often fail when they try to replicate studies. This psychologist explains why.

Lisa Alisa/Shutterstock

For a landmark collaborative study, published today in the journal Science, researchers tried to replicate 100 recent psychology studies from top journals to see if they'd get the same results as the original authors.

Overwhelmingly, they failed.

Replication — the attempt to validate previous findings by reproducing experiments — is one of the foundational ideas behind the scientific method. It tells researchers, and those who use their studies (policymakers, patients, doctors), whether research results are actually reliable or whether they might be wrong.

In this case, about 36 percent of the replications showed an effect that was consistent with the original study. So the failure rate was more than 60 percent.

And it's not the first time a high-profile replication effort returned concerning results — a dismal state of affairs that has led some prominent thinkers to estimate that most published scientific research findings are wrong.

But this latest study should not be read as more bad news in a distressing conversation about science's failures, says Brian Nosek, the University of Virginia psychologist who led the effort. It is part of the Reproducibility Project: Psychology, one of many high-profile collaborative efforts to retest important research results across a range of scientific fields. The goal: to strengthen the foundation of the house of science. We talked to Nosek about the study; what follows is our conversation, lightly edited for length and clarity.

Julia Belluz: We talk a lot about the need to replicate and reproduce studies. But I think there's little appreciation about what that actually means. Can you describe what it took to replicate 100 studies?

Brian Nosek: It is labor-intensive work. It’s easier than the original research — you don't also have to generate the materials from nothing. But at the same time, there are challenges in understanding how another group did a study. The areas where it is a lot of work are in reviewing the methodology [the description of how the study was done] from the materials that are available, then trying to ascertain how they actually did the study. What is it that really happened?

The most interesting parts of developing these replications involved requesting original materials from the authors and comparing that against the described methodology, writing out a new methodology, and then sending that back to the original authors for their review, comments, and revisions. A lot of times in that process, researchers would say, "We actually did this thing or that thing." It isn’t because they did something wrong, but because the norms of science are to be brief in describing the methodology.

JB: Does this mean scientists aren't always doing a good job of writing detailed enough methodologies?

replication jb

Each dot represents a study, and you can see the original study effect size versus replication effect size. The diagonal line represents the replication effect size equal to original effect size. Points below the dotted line were effects in the opposite direction of the original. (Science)

BN: It would be great to have stronger norms about being more detailed with the methods in the paper. But even more than that, it would be great if the norm were to post procedural details as supplements in the paper. For a lot of papers, I don't need to know those details if I’m not trying to replicate it. I'm just reading the paper, trying to learn about the outcomes. But for stuff that’s in my area — I need access to those details so I can really understand what they did. If I can rapidly get up to speed, I have a much better chance of approximating the results.

JB: Right now, there's a tendency to think failed replications mean the original research was wrong. (We saw this with the recent discussion around the high-profile "worm wars" replication.) But as your work here shows, that's really not necessarily the case. Can you explain that?

BN: That's a really important point, and it applies to all research. If you have motivations or stakes in the outcome, if you have a lot of flexibility in how you analyze your data, what choices you make, political ideologies — all those things can have a subtle influence, maybe without even the intention [to game the results of the replication].

So pre-registration [putting the study design on an open database before running the study, so you can't change the methods if you get results you don't like] is an important feature of doing confirmatory analysis in research. That can apply to replication efforts, as well. If you’re going to reanalyze the data, or, in our case, where you’re doing a study with brand new data collection, the pre-registration process is a way to put your chips down.

JB: After helping run this massive experiment, do you have any advice for others?

BN: My main observation here is that reproducibility is hard. That's for many reasons. Scientists are working on hard problems. They're investigating things where we don't know the answer. So the fact that things go wrong in the research process, meaning we don't get to the right answer right away, is no surprise. That should be expected.

There are three reasons that a replication might get a negative result when the original got a positive result. One, the original is a false positive — the study falsely detected evidence for an effect by chance. Two, the replication is a false negative — the study falsely failed to detect evidence for an effect by chance. Three, there is a critical difference in the original and replication methodology that accounts for the difference.

JB: Can you give me an example?

BN: Imagine an original study that found a relationship between exercise and health. Researchers conclude that people who exercise more are healthier than people who do not. A replication team runs a very similar study and finds no relationship.

One and two [described above] are possibilities that one of the teams' evidence is incorrect and the other evidence is more credible.

Three [described above] is the possibility that when the teams look closely, they realize that the original team did their study on only women and the replication team did their study on only men. Neither team realized that this might matter — the claim was that the exercise-health relationship was about people. Now that they see the results, they wonder if gender matters.

The key is that we don't know for sure that gender matters. It could still be one or two. But we have a new hypothesis to test in a third study. And if confirmed, it would improve our understanding of the phenomenon. Was it the changes in the sample? The procedure? Being able to dig into the differences where you observe that is a way to get a better handle on the phenomenon. That’s just science doing science.

JB: We're hearing a lot about replication efforts these days. Is it more talk than action? Or if not, which country is leading the effort?

BN: I have no sense of the place that’s leading in funding. But the US is among the places where there’s the most progress. The NIH and NSF [National Institutes of Health and National Science Foundation] have been looking into supporting replication research. And the Netherlands has had a lot of conversations about this.

But it's definitely [more popular now]. For me, it's a question of research efficiency. If we only value innovation, we’re going to get a lot of great ideas and very little grounding in the stability of those ideas. The effort on improving reproducibility while paying attention to fact that innovation is the primary driver of science will help us be better stewards of public funding in science and help science fulfill its promise. There aren't better alternatives. We really need to get this right.