clock menu more-arrow no yes mobile

Filed under:

Scientists replicated 100 recent psychology experiments. More than half of them failed.

Sergey Nivens/Shutterstock

Replication is one of the foundational ideas behind science. It's when researchers take older studies and reproduce them to see if the findings hold up.

Testing, validating, retesting: It's all part of the slow and grinding process to arrive at some semblance of scientific truth.  

Yet it seems that way too often, when we hear about researchers trying to replicate studies, they simply flop or flounder. Some have even called this a "crisis of irreproducibility."

Consider the newest evidence: a landmark study published today in the journal Science. More than 270 researchers from around the world came together to replicate 100 recent findings from top psychology journals. By one measure, only 36 percent showed results that were consistent with the original findings. In other words, many more than half of the replications failed.

The results of this study may actually be too generous

"The results are more or less consistent with what we've seen in other fields," said Ivan Oransky, one of the founders of the blog Retraction Watch, which tracks scientific retractions. Still, he applauded the effort: "Because the authors worked with the original researchers and repeated the experiments, the paper is an example of the gold standard of replication."

But Stanford's John Ioannidis, who famously penned a paper arguing that most published research findings are wrong, explained that exactly because it's the gold standard, the results might be a little too generous; in reality, the replication failure rate might be even higher.

"I say this because the 100 assessed studies were all published in the best journals, so one would expect the quality of the research and the false rates to be higher if studies from all journals were assessed," he said.

The 100 studies replicated ended up excluding some 50 others for which the replication was thought to be too difficult. "Among those that did get attempted, difficult, challenging replication was a strong predictor of replication failure, so the failure rates might have been even higher in the 50 or so papers that no one dared to replicate," Ioannidis said.

Again, the scientists worked closely with the researchers of the original papers, to get their data and talk over the details of their methods. This is why this effort is considered top quality— they tried really hard to understand the original research and duplicate it — but that collaboration may have also biased the results, increasing the chances of a successful replication.

"In a few cases [the original authors] affected the choice of which exact experiment among many should be attempted to replicate," said Ioannidis.

Just listen to how difficult it was to repeat just one experiment

Even with all this buy-in and support, running a replication is an extremely difficult task, explained one of the people on the team, University of Virginia PhD candidate David Reinhard. In fact, after talking to Reinhard, I've come to the view the chance of reproducing a study and arriving at the same result — especially in a field like psychology, where local culture and context are so important — as next to nil.

Reinhard had been hearing a lot about the problem of irreproducibility in science recently and wanted to get firsthand experience with replication. He had no idea what he was in for — and his journey tells a lot about how arduous science can be.

To begin with, the original study he wanted to replicate failed during the pretesting stage. That's the first little-appreciated step of any replication (or study, for that matter) when researchers run preliminary tests to make sure their experiment is viable.

The study he finally settled on was originally run in Germany. It looked at how "global versus local processing influenced the way participants used priming information in their judgment of others."

In English, that means the researchers were studying how people use concepts they are currently thinking about (in this case, aggression) to make judgments about other people's ambiguous behavior when they were in one of two mindsets: a big-picture (global) mindset versus a more detail-oriented (local) mindset. The original study had found that they were more suggestible when thinking big.

"Fortunately for me, the authors of the study were helpful in terms of getting the materials and communication," Reinhard said. He spent hours on the phone with them — talking over the data, getting information about details that were missing or unclear in the methods section of the paper (where researchers spell out how an experiment was conducted). He also had to translate some of the data from German to English, which took more time and resources.

This cooperation was essential, he said, and it's not necessarily always present. Even still, he added, "There were a lot of difficulties that arose."

Reinhard had to figure out how to translate the social context, bringing a study that ran in Germany to students at the University of Virginia. For example, the original research used maps from Germany. "We decided to use maps of one of the states in the US, so it would be less weird for people in Virginia," he said.

Another factor: Americans' perceptions of aggressive behavior are different from Germans', and the study hinged on participants scoring their perceptions of aggression. The German researchers who ran the original study based it on some previous research that was done in America, but they changed the ratings scale because the Germans' threshold for aggressive behavior was much higher.

Now Reinhard had to change them back — just one of a number of variables that had to be manipulated.

In the end, he couldn't reproduce their findings, and he doesn't know why his experiment failed. "When you change the materials, a lot of things can become part of the equation," he said. Maybe the cultural context mattered, or using different stimuli (like the new maps) made a difference.

Or it could just be that the original finding was wrong.

"I still think replication is an extremely important part of science, and I think that’s one of the really great things about this project," Reinhard said. But he's also come to a more nuanced view of replication, that sometimes the replications themselves can be wrong, too, for any number of reasons.

"The replication is just another sort of data point that there is when it comes to the effect but it’s not the definitive answer," he added. "We need a better understanding of what a replication does and doesn’t say."

Here's how to make replication science easier

After reading the study and talking to Reinhard, I had a much better sense of how replication works. But I also felt pretty sorry about the state of replication science.

It seemed a little too random, unsystematic, and patchwork — not at all the panacea many have made it out to be.

I asked Brian Nosek, the University of Virginia psychologist who led the Science effort, what he learned in the process. He came to a conclusion very similar to Reinhard's:

My main observation here is that reproducibility is hard. That's for many reasons. Scientists are working on hard problems. They're investigating things where we don't know the answer. So the fact that things go wrong in the research process, meaning we don't get to the right answer right away, is no surprise. That should be expected.

To make it easier, he suggested some fixes. For one thing, he said, scientists need to get better at sharing the details — and all the assumptions they may have made — in the methods sections of their papers.

"It would be great to have stronger norms about being more detailed with the methods," he said. He also suggested added supplements at the end of
papers that get into the procedural nitty-gritty, to help anyone wanting to repeat an experiment.

"If I can rapidly get up to speed, I have a much better chance of approximating the results," he said.
(Nosek has detailed other potential fixes in these guidelines for publishing scientific studies, which I wrote about here — all part of his work at the Center for Open Science.)

Ioannidis agreed and added that more transparency and better data sharing are also key. "It is better to do this in an organized fashion with buy-in from all leading investigators in a scientific discipline rather than have to try to find the investigator in each case and ask him or her in detective-work fashion about details, data, and methods that are otherwise unavailable," he said. "Investigators move, quit science, die, lose their data, have their hard drives with all their files destroyed, and so forth."

What both Ioannidis and Nosek are saying is that we need to have a better infrastructure for replication in place. 

For now, science is
slowly lurching along in this this direction. And that's good news, because trying to do a replication — even with all the infrastructure of a world-famous experiment behind you, as Reinhard had — is challenging. Trying to do it alone is probably impossible.

Further reading: