What decision are we trying to make, anyway?
Everyone acknowledges, or even agrees with, the issues and limitations of statistical significance (which I previously wrote about here): the effect size is ignored, you’ll get it with a really large sample, it doesn’t account for practical importance, and so on…
Yet it is still used all the time, and life goes on. People debate these nuances back and forth as to when and where it is or isn’t valid, etc. But just recently in a conversation on LinkedIn, I had a realization of the fundamental reason on why it is an issue, and why there are many calls to ban p-values and statistical significance entirely.
“Except experimental field, like randomized clinical trials, where both statistical and practical significance are necessary and when combined, answer the question of interest.”
It’s not that a p-value, or statistical significance itself even, is inherently bad or wrong. It all has to do exclusively with how they are used, interpreted, implemented, and implicated.
Take this point from Adrian Olszewski’s post (which was cited in the conversation above) talking about using statistical significance in combination with practical importance, but in a mathematically formal manner.
The key quote here is:
“We all make BINARY decisions every day, based on some criteria.”
And from this supporting image:
“Thresholds are essential in binary decision making in experimental studies…”
All of that is well and good, and sure, makes sense if statistical significance helps you solve a problem in that way. But the main thing that stuck out to me was the emphasis on decisions.
THIS is exactly the problem with statistical significance.
We use it to declare there is or isn’t an effect of some sort, therefore implicating decision X, Y, or Z. But in many (or most) cases, it is only a guise of a decision actually being made. Think about it: in a typical academic publication, they collect data, run some statistical modeling and/or tests, and get a p-value. The “decision” being made amounts to writing the words “there was a significant difference” on a page in some boilerplate fashion, not an actual action being taken. Where’s the decision?
Sure, you might argue that a published statistical test result could lead to many decisions being made by those consuming the analysis. That’s exactly the point. We have no idea what these decisions are. It is intractable. Those subsequent decisions are not actually related to or reflected in the statistical setup and testing done in the original analysis. We don’t know who is using the results, how they are using them, or what they are using them for. It’s just words on a page.
Contrast this with, say, William Sealy Gosset (the inventor of the t-test, and who I wrote about previously), whose primary goal was to brew better beer. His statistical testing methodology, and the use of statistical significance, was directly tailored to the problem at hand and the decisions he was trying to make in order to produce higher quality product. It was tangible. He could see the impact with his own two eyes of using significance testing for decisions he was actually making. And in his case, it provided value. Kind of like the way a tape measure is as useful as the context it is applied and the tolerance needed for making a decision. It’s not just the fact that a tape measure was used that creates its importance, it’s that the information happens to be useful in that situation so subsequent action can be taken.
So when we say statistical significance or p-values should be banned, what we really mean is that they shouldn’t be used the way they are. If we could just find better uses for them, then sure, they can stay. The problem is that incentive structures, education, etc. are poorly designed around these concepts, and they are so deeply ingrained that it’s difficult to dig our way out.