clock menu more-arrow no yes mobile

Filed under:

A bot crawled thousands of studies looking for simple math errors. The results are concerning.

“Statcheck” is a program that automatically detects errors in psychology papers.

There are a lot of problems with science. Rounding errors, yes rounding errors, are one of them.  
Max Griboedov / Shutterstock
Brian Resnick is Vox’s science and health editor, and is the co-creator of Unexplainable, Vox's podcast about unanswered questions in science. Previously, Brian was a reporter at Vox and at National Journal.

On August 25, Jennifer Tackett, a psychology professor at Northwestern University, got an odd email in her inbox. It was from PubPeer, an online forum where people share and discuss scientific articles. And it made her a little anxious.

“PubPeer is typically used to point out errors in papers, and no scientist wants to find errors in their published work,” Tackett tells me in an email.

When she clicked through to the PubPeer comment, this is what she saw.

A program called “Statcheck” was writing to inform her it had been double-checking the math in her paper.

Tackett wasn’t the only person to get such an email. Statcheck had scanned 50,000 papers and uploaded the results to the PubPeer comments. And while Tackett’s paper checked out, others had a more unpleasant surprise.

Statcheck is a simple program built to correct simple errors. But it provokes bigger questions about how the field of psychology should go about correcting errors in past work. Because psychologists are starting to find a great many.

Why scientists need robots to check their math

In science, even the tiniest details matter.

There are a lot of problems in science right now: Publication bias, p-hacking, incentives to publish a quantity of papers over quality papers. Those are big, systemic problems, as we’ve outlined extensively here at Vox.

But Michèle Nuijten, a PhD student at Tilburg University in the Netherlands who co-created Statcheck, has her sights on fixing a much smaller but surprisingly impactful problem in science: rounding errors.

“When starting this project, I wouldn’t say [this was a big problem],” Nuijten tells me. “We’re detecting when people are making rounding errors, who cares?”

But she and some colleagues in the Netherlands were curious enough to check. They built a computer program that could quickly scan published psychological papers and check the math on the statistics. They called their program “Statcheck” and ran it on 30,717 papers.

Rounding errors, and other small potential mistakes in calculating the statistics, were rampant. “We found that half of all published psychology papers ... contained at least one p-value that was inconsistent with its test,” Nuijten and her co-authors reported in 2015 in the journal Behavior Research Methods.

Most striking was that the errors weren’t entirely random. Most of the errors tipped the results in favor of statistical significance. And around 13 percent of the papers contained an error that could potentially change the paper’s conclusions.

Were study authors purposely fudging their arithmetic? Not necessarily. These types of mistakes are easy to make — they’re “typos,” errors in transferring data from one program to another, she says. But because academic journals are more likely to publish significant results, these erroneous findings may be more likely to find themselves in the published literature.

Statcheck made the results of 50,000 papers public. Was that wrong?

Nuijten’s co-author Chris Hartgerink, also a PhD student at Tilburg, wanted to make more people aware of potential problems in their work. So he published Statcheck’s results on 50,000 papers on PubPeer (with PubPeer’s consent).

The point was not to accuse the authors but to perform something like a spell check, Hartgerink tells me in an email. And not everyone was pleased. “Some indeed objected due to unrequested feedback, and some found the emails annoying from PubPeer [since PubPeer notifies authors of comments] if Statcheck found no potential errors,” he says.

Psychology is currently going through a period of painful introspection, as researchers realize that their methods and institutions have produced results that do not replicate. There’s a big, uncomfortable question of how to criticize past work, and whether online critiques of past work constitute “bullying” or shaming. The PubPeer comments are just a tiny part of that debate. But it can hit a nerve.

Susan Fiske, a former president of the Association for Psychological Science, alluded to Statcheck in an interview with Business Insider, calling it a “gotcha algorithm.”

Recently, Fiske unleashed a firestorm of debate, after a draft of an op-ed column leaked in which she criticize the broader trend of researchers critiquing papers outside the peer review system. The draft of the article said these online critics were engaging in “methodological terrorism.” The gist of her article was that she feared too much of the criticism of past work had been ceded to social media, and that the criticism there is taking on an uncivil tone. (Fiske’s critics responded that she was being uncivil in using terms like “methodological terrorism.”)

Dorothy Bishop, an Oxford psychologist, appreciated the intention of Statcheck checking errors, she writes on her blog, but worries the tone of the automatically generated comments will “alienate many people.” She continues:

I would like to know if errors were found in my stats, and it is entirely possible that there are some, since none of us is perfect. So I don't want to over-react, but I think that if I, as someone basically sympathetic to this agenda, was irritated by the style of the communication, then the odds are this will stoke real hostility for those who are already dubious about what has been termed 'bullying' and so on by people interested in reproducibility.

Michael Kane (whose tweet I’ve embedded at the top of this story) tells me he was glad for the heads up, even if it caught him by surprise.

“I know I make mistakes (I've had to request corrections to two of my own papers for minor transcription mix-ups that were my fault, not the journal's),” he writes me in an email, “so I don't mind my work being checked up on. I think it's generally a good thing for the field.”

The goal isn’t to shame researchers or to replace human reviewers, Hartgerink and Nuijten both stress, but to help them avoid making these mistakes in the future. In doing so, it will help clean up a tiny but impactful problem in the psychological literature (of which there are many).

“The literature is growing faster and faster, peer review is overstrained, and we need technology to help us out,” Hartgerink says. “These tools are scalable with almost no marginal costs, whereas human elements are not.”

Hartgerink and Nuijten also underscore that Statcheck is not perfect.

“Most of the time, it works,” Nuijten says. (The program works because there are strict rules for formatting psychological findings. If a paper deviates from the standard style, or if it uses trickier statistics, Statcheck might get confused.) Statcheck is still a work in progress.

Ultimately, if Statcheck flags an “inconsistency” (it won’t explicitly say the paper is wrong), researchers ought to confirm it by hand.

Now anyone can use Statcheck

Rather than performing analyses after the fact, Nuijten and Hartgerink want Statcheck to stop errors from being published in the first place. (Hartgerink says he currently doesn’t have plans to upload more results onto PubPeer.)

That’s why this week they launched a web application where anyone can upload a psychological research paper (it has to be formatted in American Psychological Association style) and get a quick check of their math.

In the future, Nuijten wants to expand Statcheck’s reach and give it the ability to scan papers in biomedical sciences (the problems with reproducibility are prevalent in that field as well).

And overall, Statcheck makes an interesting case for the use of automated tools for fixing science. So many of the problems in science today stem from human error or human biases. Maybe the solutions, then, shouldn’t be human.