clock menu more-arrow no yes

An unhealthy obsession with p-values is ruining science

Lightspring/Shutterstock

Over the past couple of years, Stanford meta-researcher John Ioannidis and several colleagues have been working on a paper that should make any nerd think twice about p-values, those tests of statistical significance that are now commonly perceived as a signal of a study's worth.

Their paper, published today in JAMA, examines p-values across 25 years of biomedical research. That involved doing some seriously impressive data crunching: The researchers analyzed more than 1.6 million study abstracts and more than 385,000 full-text papers, all of which included p-values.

What they found was "an epidemic" of statistical significance: 96 percent of the papers that included a p-value in their abstract boasted statistically significant results (on a scale from 0 to 1, a p-value that's statistically significant measures 0.05 or lower).

What’s more, Ioannidis told Vox, "the proportion of papers that use p-values is going up over time, and the most significant results have become even more significant over time." Only about 10 percent of the papers he looked at mentioned effect sizes in their abstracts, for example, and even fewer mentioned measures of uncertainty, such as confidence intervals. So very rarely were researchers giving any context about the real importance of their p-value findings.

All this means that as p-values have become more popular, they've also become more meaningless.

"If you’re a pessimist," Ioannidis added, "this may be called p-value trash."

But even if you’re an optimist, the new study suggests the entire biomedical world has been furiously chasing statistical significance, potentially giving dubious results the appearance of validity by churning them through this increasingly popular statistical method, or simply suppressing important results that don't look significant enough.

In the biomedical context, this finding is worrying. It means drugs and medical devices that don't work so well may be sold using p-values that suggest they do.

"So the big picture," Ioannidis concluded, "is that there are these millions and millions of papers with millions and millions of p-values floating around, and many are misleading."

Good luck trying to find a really clear definition of a p-value

If you're struggling to wrap your head around the definition of a p-value, you're not alone.

In the broadest sense, it's simply one of many ways researchers can test a hypothesis using statistics.

A more detailed and still comprehensible definition is actually shockingly hard to come by.

Here's a recent stab from the American Statistical Association:

"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."

I called Rebecca Goldin, the director for Stats.org and a professor at George Mason University, for help parsing that still perplexing definition. She walked me through an example using drug studies, the kind Ioannidis and his colleagues examined.

Say a researcher has run a study testing the effect of a drug on an outcome like cholesterol, and she's trying to see whether the people on the drug (group A) improved their cholesterol levels more than the people who did not take the drug (group B). Let's say she finds that patients in group A (who got the medicine) also happened to lower their cholesterol more than those in group B (who didn't get the medicine).

The researcher has no way of knowing whether that difference in cholesterol levels is because of the medicine or some other difference between the two groups. "She cannot 'see' with her data alone whether, behind the scenes, God was rolling dice or whether the medicine was influencing cholesterol levels," said Goldin. In other words, the difference in cholesterol levels between the two groups may have occurred because of chance or because of the medicine — but that's a question the researcher can't answer using the data she has.

But there is something she can answer: If it were randomness alone ("God rolling the dice"), then how likely would it be that people's cholesterol levels came out as they did in this study? This is where the p-value comes in.

She can use a statistical method (in this case, resulting in a p-value) to check the probability that she would see the difference in cholesterol between the groups (or more extreme differences) under the assumption that the medicine had nothing to do with the difference. This assumption is called the "null hypothesis," and generating a p-value always starts with a null hypothesis.

To actually calculate the p-value, the researcher would plug a bunch of numbers about her data — the number of people in the study, the average change in cholesterol for both groups, the standard deviation for each group, etc. — into a calculator. Again, the p-value that the calculator spits out will be the probability of seeing this data (the difference in cholesterol levels between the two groups) or more extreme data, given the null hypothesis (the medicine didn't work). A p-value of less than 0.05 is considered "statistically significant" by many in the medical community€” — an indicator that the data are unlikely, though still possible, if the medicine weren't working.

To be clear: The p-value will not tell the researcher how likely it is that the medicine is working (or not working). So it won't tell her whether her original hypothesis (about whether the medicine works) is true or false. Instead, the p-value tells her the probability of seeing her data (the difference between group A and B) given a null hypothesis. And, again, if the p-value is low (less than 0.05), the probability that this data would arise is low, providing some evidence that the medicine is having an impact.

Why the p-value crisis matters

Ioannidis's paper, which raises questions about the trustworthiness of p-values, doesn't come in isolation.

Though statisticians have long been pointing out problems with "significance doping" and "P-dolatory­" (the "worship of false significance") journals have increasingly relied on p-values to determine whether a study should be published.

"We fear the p-value is used as a gatekeeper for determining what’s publishable research," said Ron Wasserstein, the executive director of the American Statistical Association. This means that good research with higher p-values is being turned away, that authors may turn themselves away from submitting to journals when they get a high p-value, or, even worse, that authors game their p-values or selectively report only low p-values (dubbed "p-value hacking") in order to make them appear statistically significant and therefore publishable.

"I'm concerned that work that’s important doesn't see the light of day because p-values didn't come out to be below 0.05," said Wasserstein. "I'm concerned that work that is published is published and considered successfully evidentiary based on low p-values."

When I asked Wasserstein how we arrived at this moment, he had a couple of guesses. First, software makes churning p-values out easier than ever. And second, a p-value is a temptingly easy figure to rely on when deciding whether research is valuable. "It's this number that looks like you could use it to make a decision that might otherwise be difficult to make or require a whole lot more effort to make," he said. Unfortunately, that's not true.

It doesn't have to be this way

Most ironic about this state of affairs is that the p-value had much more modest origins, as statistician Regina Nuzzo reported in Nature: When p-values were introduced by UK statistician Ronald Fisher in the 1920s, he intended them to be "one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions." They weren't the be-all and end-all of significance; again, they were intended as just one tool in the statistical toolbox.

But even with all of this controversy, few are suggesting abandoning the p-value all together. Instead, the American Statistical Association just released guidance on p-value principles in an effort to use the method more conservatively and more accurately:

1. P-values can indicate how incompatible the data are with a specified statistical model.

2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and transparency.

5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Even Ioannidis doesn't think the p-value should be thrown out. Instead, he said, journals should crack down on their use of p-values. "They should insist on more [information] about what is the effect size, the uncertainty around effect size, and how likely [the results are] to be true."