Statisticians want to abandon science’s standard measure of ‘significance’

Author: Bethany Brookshire / Source: Science News

statistics — The concept of “statistical significance” has become scientific shorthand for a finding’s worth. What might science look like without it?

In science, the success of an experiment is often determined by a measure called “statistical significance.” A result is considered to be “significant” if the difference observed in the experiment between groups (of people, plants, animals and so on) would be very unlikely if no difference actually exists.

The common cutoff for “very unlikely” is that you’d see a difference as big or bigger only 5 percent of the time if it wasn’t really there — a cutoff that might seem, at first blush, very strict.

It sounds esoteric, but statistical significance has been used to draw a bright line between experimental success and failure. Achieving an experimental result with statistical significance often determines if a scientist’s paper gets published or if further research gets funded. That makes the measure far too important in deciding research priorities, statisticians say, and so it’s time to throw it in the trash.

More than 800 statisticians and scientists are calling for an end to judging studies by statistical significance in a March 20 comment published in Nature. An accompanying March 20 special issue of the American Statistician makes the manifesto crystal clear in its introduction: “‘statistically significant’ — don’t say it and don’t use it.”

There is good reason to want to scrap statistical significance. But with so much research now built around the concept, it’s unclear how — or with what other measures — the scientific community could replace it. The American Statistician offers a full 43 articles exploring what scientific life might look like without this measure in the mix.

This isn’t the first call for an end to statistical significance, and it probably won’t be the last. “This is not easy,” says Nicole Lazar, a statistician at the University of Georgia in Athens and a guest editor of the American Statistician special issue. “If it were easy, we’d be there already.”

What’s does statistical significance offer?

Many scientific studies today are designed around a framework of “null hypothesis significance testing.” In this type of test, a scientist compares results of an experiment asking, say, if a drug reduces depression in a treated versus control group. The scientist compares the results against the hypothesis that no difference really exists between the groups. The goal is not to prove that the drug fights depression. Instead, the idea is to gather enough data (eventually) to reject the hypothesis that it doesn’t.

The scientist will compare the groups using a statistical analysis that results in a P value, a result between 0 and 1, with the “P” standing for probability. The value signifies the likelihood that repeating the experiment would yield a result with a difference as big (or bigger) than the one the scientist got if the drug doesn’t actually reduce depression. Smaller P values mean that the scientist is less likely to see a difference that large if no difference really exists. In scientific parlance, the value is “statistically significant” if P is less than or equal to 0.05.

When scientists interpret P values correctly, they can be useful for finding out how compatible experimental results are with the scientists’ expectations, Lazar says. Because a P value is a probability, it “has variability attached to it,” she explains. “If I repeated my procedure over and over, I’d get a whole range of P values. Some would be significant, some wouldn’t.”

Because of this variability, P equal to 0.05 was never meant to be an end result. Instead, it was more of a beginning, “something that would cause you to raise your eyebrows and investigate further,” Lazar says.

Where did the idea for statistical significance come from?

Many scientists now interpret P equal to 0.05 as a cutoff between an experiment that “worked” and one that didn’t. That cutoff can be attributed to one man: famed 20th century statistician Ronald Fisher. In a 1925 monograph, Fisher offered a simple test that research scientists could use to produce a P value. And he offered the cutoff of P equals 0.05, saying “it is convenient to take this point as a limit in judging whether a deviation [a difference between groups] is to be considered significant or not.”

That “convenient” suggestion has reverberated far beyond what Fisher probably intended. In 2015, more than 96 percent of papers in the PubMed database of biomedical and life science papers boasted results with P less than or equal to 0.05.