Panel explores ongoing controversy of p-values in scientific research


Steven Goodman, MD, MHS, PhD
Steven Goodman, MD, MHS, PhD

The appropriate use of p-values and the extent to which they should be used, if at all, in scientific research is a topic that has been discussed and debated by scientists and statisticians for decades.

An American Association for Cancer Research 2021 special forum, Biostatistics Debate: Should Science Be Guided by P-Values?, featured a panel of experts who discussed current schools of thought regarding the use of p-values. Registrants can watch a replay of this forum anytime until June 21, 2021.

Steven Goodman, MD, MHS, PhD, Stanford University, began with a review of why p-values were first introduced into science, why their use provoked early controversy, and what can be done today to reclaim the integral aspects of scientific reasoning.

“In the early 1920s, RA Fisher, a statistician and geneticist and probably one of the greatest scientists of the 20th century, was trying to develop a system of inference that facilitated inductive reasoning and avoided Bayes theorem,” Goodman said. “But he didn’t want to have a system of inference that required judging the plausibility of a hypothesis at the beginning, so he developed a whole system of inference, and part of this system was something that he dubbed the p-value.”

The p-value in Fisher’s framework, Goodman explained, was a statistical measure of “distance,” or compatibility, of an observed data summary—which could be a mean, a risk difference, or an odds ratio — from its hypothesized null value.

That was the beginning of the controversy and ongoing debate that continues today.

So, why are p-values so hard to abandon?

“It’s because NHST (null hypothesis significance testing) and p-values are accepted epistemic currency; they’re traded for knowledge claims, publication, and reputational benefits,” Goodman said. “The naïve use of p-values, with rigid cutoffs, have had an insidious effect. They’ve robbed the scientific community of a language for, and experience with, epistemic uncertainty about hypotheses. We don’t know how to collectively discuss prior probability, or how to combine this with data to justify degrees of confidence in our conclusions.”

An answer to the title question of this forum—should science be guided by p-values?—is complicated. Goodman said.

“Guided, possibly, but obviously not dictated,” he said. “They (p-values) come with far too much conceptual and professional baggage to be used properly. We can’t solve this problem just by focusing on statistics and p-values, so we need to reform this complex, interconnected system of incentives and rewards for ‘positive’ studies before we can have the room to change the statistical foundations of that system. That is slowly beginning to happen now.”

Mi-Ok Kim, PhD
Mi-Ok Kim, PhD

In her presentation, Mi-Ok Kim, PhD, UCSF Helen Diller Family Comprehensive Cancer Center, argued that science should not be guided by p-values, but the problems with p-values are not just about p-values. They are problems of inference, and proper inference requires full reporting and transparency.

“It is not the p-value that is at fault; it is human psychology,” Kim said. “The misinterpretation and misuse of p-values is because our expectation does not match our reality, and human instinct kicks in to fool ourselves.”

Inference is difficult, Kim said, noting that the inferential content of an experiment is not simple, and one statistic summary is not sufficient to comprehend all the different nuances in inferential content.

“Additionally, our human psychology plays out, as well,” Kim said. “We are not patient with the necessary nuances of expression that good statistical reporting requires; we also tend to oversimplify by dichotomizing the results as significant or non-significant, that is, p-values less than .05 or not.”

Motomi Mori, PhD
Motomi Mori, PhD

In the forum’s final presentation, Motomi Mori, PhD, St. Jude Children’s Research Hospital, talked about the use of p-values in basic and translational research experiments, which are designed and regulated differently compared to clinical trials.

In contrast to hypothesis testing in clinical trials, Mori noted that the analyses of translational studies are often conducted with a focus on hypothesis generation.

“There is a pervasive use of p-value less than .05 as statistical significance in basic and translational research publications,” Mori said. “Statisticians are often forced to try different analytical approaches until they find something interesting or statistically significant, which is called p-hacking or data dredging, leading to selective reporting and, eventually, publication bias.”

Changing the culture of using p-values as a binary indicator and the basis for scientific decisions takes the efforts of the whole scientific community, Mori said, not just the statistical community.