Over the past couple of years, Stanford meta-researcher John Ioannidis and several colleagues have been working on a paper that should make any nerd think twice about p-values, those tests of statistical significance that are now commonly perceived as a signal of a study's worth.
Their paper, published today in JAMA, examines p-values across 25 years of biomedical research. That involved doing some seriously impressive data crunching: The researchers analyzed more than 1.6 million study abstracts and more than 385,000 full-text papers, all of which included p-values.
What they found was "an epidemic" of statistical significance: 96 percent of the papers that included a p-value in their abstract boasted statistically significant results (on a scale from 0 to 1, a p-value that's statistically significant measures 0.05 or lower).
What’s more, Ioannidis told Vox, "the proportion of papers that use p-values is going up over time, and the most significant results have become even more significant over time." Only about 10 percent of the papers he looked at mentioned effect sizes in their abstracts, for example, and even fewer mentioned measures of uncertainty, such as confidence intervals. So very rarely were researchers giving any context about the real importance of their p-value findings.
All this means that as p-values have become more popular, they've also become more meaningless.
"If you’re a pessimist," Ioannidis added, "this may be called p-value trash."
In the biomedical context, this finding is worrying. It means drugs and medical devices that don't work so well may be sold using p-values that suggest they do.
"So the big picture," Ioannidis concluded, "is that there are these millions and millions of papers with millions and millions of p-values floating around, and many are misleading."
Good luck trying to find a really clear definition of a p-value
If you're struggling to wrap your head around the definition of a p-value, you're not alone.
In the broadest sense, it's simply one of many ways researchers can test a hypothesis using statistics.
A more detailed and still comprehensible definition is actually shockingly hard to come by.
Here's a recent stab from the American Statistical Association:
"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."
I called Rebecca Goldin, the director for Stats.org and a professor at George Mason University, for help parsing that still perplexing definition. She walked me through an example using drug studies, the kind Ioannidis and his colleagues examined.
Why the p-value crisis matters
Ioannidis's paper, which raises questions about the trustworthiness of p-values, doesn't come in isolation.
Though statisticians have long been pointing out problems with "significance doping" and "P-dolatory" (the "worship of false significance") journals have increasingly relied on p-values to determine whether a study should be published.
"We fear the p-value is used as a gatekeeper for determining what’s publishable research," said Ron Wasserstein, the executive director of the American Statistical Association. This means that good research with higher p-values is being turned away, that authors may turn themselves away from submitting to journals when they get a high p-value, or, even worse, that authors game their p-values or selectively report only low p-values (dubbed "p-value hacking") in order to make them appear statistically significant and therefore publishable.
"I'm concerned that work that’s important doesn't see the light of day because p-values didn't come out to be below 0.05," said Wasserstein. "I'm concerned that work that is published is published and considered successfully evidentiary based on low p-values."
When I asked Wasserstein how we arrived at this moment, he had a couple of guesses. First, software makes churning p-values out easier than ever. And second, a p-value is a temptingly easy figure to rely on when deciding whether research is valuable. "It's this number that looks like you could use it to make a decision that might otherwise be difficult to make or require a whole lot more effort to make," he said. Unfortunately, that's not true.
It doesn't have to be this way
Most ironic about this state of affairs is that the p-value had much more modest origins, as statistician Regina Nuzzo reported in Nature: When p-values were introduced by UK statistician Ronald Fisher in the 1920s, he intended them to be "one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions." They weren't the be-all and end-all of significance; again, they were intended as just one tool in the statistical toolbox.
But even with all of this controversy, few are suggesting abandoning the p-value all together. Instead, the American Statistical Association just released guidance on p-value principles in an effort to use the method more conservatively and more accurately:
1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Even Ioannidis doesn't think the p-value should be thrown out. Instead, he said, journals should crack down on their use of p-values. "They should insist on more [information] about what is the effect size, the uncertainty around effect size, and how likely [the results are] to be true."