Statistical significance is…insignificant

History
Philosophy
Research
“A cheap way to get marketable results” -William Kruskal
Author

Alex Zajichek

Published

December 22, 2023

The longer I’ve been practicing as a statistician, maybe paradoxically, the more skeptical I’ve become of statistical significance (#1). It manifests as a feeling of dissatisfaction, as if, even though you’ve stated what you “found”, you don’t actually believe it to be true. I recently finished reading The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives–it instantly became one of my favorite books (here are my favorite quotes and passages). It affirms a lot of what I’ve come to suspect, with deep articulation about the vastness of the issue, backed by a thorough historical foundation. I can’t help but wonder about the broader scientific, political, and societal implications this has had over the years (and continues to have). It really lit a fire in me to continue learning about and unraveling statistical history to connect those dots.

My take

The significance of a statistical result cannot be mechanically, mathematically, or systematically determined. It must be driven by a practical relevance, or importance, which is inherently subjective, and a product of the values, beliefs, interests, and/or goals of the individual(s) interpreting the data. That result is always subject to dispute, critique, replication, and refinement, whether those reasons are process error (#2), or the value of the information itself. The chance occurrence of a sampling probability crossing an arbitrary threshold is actually irrelevant.

What is statistical significance?

“It’s embedded like a tax code in the bureaucracy of science.”1

Technically speaking, it is when the likelihood of observing our data, if an hypothesized state of the world were true (known as the p-value), is so small (notoriously, and most often, less than 5%), that the hypothetical state of the world must be false, and therefore, we have “significant” statistical evidence to say so. It positions itself as an objective tool to decide (on the basis of this probability threshold) whether a statistical relationship is “real”, and, often, subsequently that it “matters”.

Take this recent study, for example, which is meant to characterize physician-propagated misinformation about the COVID-19 pandemic. The authors outline a set of basic premises that are used as a basis for classifying contrarian statements as misinformation (#3). At least some of this is built upon the attainment of statistical significance (or lack thereof).

As an example, in the category of Promoting Unapproved Medications for Prevention or Treatment (in the Results section), the authors state:

“The 2 most prominent medications promoted were ivermectin and hydroxychloroquine, which have been found to not be effective at treating COVID-19 infections in randomized clinical trials.”2

This premise drove them to classify social media posts like this:

“Two of my toughest COVID patients–showed up with oxygen stats of 68% and 84% and would not go to the hospital. We treated them with IVM, steroids, and breathing treatments and here they are now.”2

as misinformation (see all supportive quotes). This is an actual doctor saying the drug helped their patients, but the authors have deemed it ineffective. What justifies them making such a universal claim?

If you look at one of their references, the meta-analysis shows the relative risk for all-cause mortality was estimated to be 37%. That is, the risk of death was 63% lower in patients who received ivermectin versus placebo or standard of care. However, because the 95% confidence interval ranged from 12% to 113% (i.e., there was plausibility that ivermectin could produce up to a 13% worse mortality rate, but equally plausible an 88% risk reduction), it was deemed not statistically significant, and as the authors state:

“IVM [ivermectin], compared with control treatment, did not have an effect on the all-cause mortality rate.”3

and ultimately,

“Ivermectin is not a viable treatment option for COVID-19.”3

In other words, because the probability of observing this data, under the assumption of no difference in mortality risk (our p-value definition above), was not less than 5% (it was 31%), that gives reason to conclude no difference at all (#4). Furthermore, if the confidence interval crossed 100% by any amount, no matter how small, the p-value would have remained above 5% and not reached the threshold for statistical significance.

Why is it flawed?

“Real science changes one’s mind. That’s one way to see that the proliferation of unpersuasive significance tests is not real science.”1

An arbitrary threshold

The 5% threshold is arbitrary. Despite that common acknowledgement, willful ignorance tends to prevail due to tradition and adherence to norms. The fact that the perceived significance of a result can suddenly change from minute differences speaks to the lack of robustness in the logic. In the book, the authors frequently discuss the importance of a loss function, which focuses on the potential consequences and implications of the result on the real-world decisions that are sought to be made from the information, rather than a predefined threshold based on sampling error probability. In this sense, the allowable risk tolerance can’t be objectively or mechanically determined. It is context-dependent, and not all decisions are created equal. Yes, the p-value above was 31%, but that error rate, along with the plausible range of risks (and benefits), may be sufficient to someone needing to make a treatment decision now.

Risks are subjective

“It always depends on the loss, measured in side effects, treatment cost, death rates. The loss to a cool, scientific, impartial spectator will not be the same as the loss to the patient in question…[the balance between Type I/II errors] ‘must be left to the patient, friends, and family’.”1

Beyond the statistical significance of a result is the question of what to do about it. In the same article, the authors state the following, still in the context of misinformation:

“Claims that myocarditis was common in children who received the vaccine and that the risks of myocarditis outweighed the risk of vaccination were also unfounded.”2

Nevermind the fact that the study they reference does show an increase in monthly case volume of myocarditis and pericarditis between pre/post-vaccine periods and the authors state:

“Myocarditis developed rapidly in younger patients, mostly after the second vaccination. Pericarditis affected older patients later, after either the first or second dose.”4

The more important point is that the weight individuals place on statistical results to inform their decision making is subjective. The risk may be low, maybe even lower than the alternative, but that doesn’t inform how someone should weigh it.

“Imagine that you and your infant child are standing on a sidewalk near a busy street. You have just purchased a hot dog from the street vendor and have safely crossed the street. Scenario 1: You suddenly realize you have forgotten the mustard and if you scurry across the busy street, dodging vehicles, there is a 95% probability you’ll return safe with your mustard. Scenario 2: You forgot your child and you watch as she tries to cross the street herself, if you scurry across the busy street, dodging vehicles, there is a 95% probabiliity you’ll return safe with your child. The sizeless scientist in effect declares ‘they are equally important reasons for crossing the street’”1

It can’t depend on sample size

“At high sample sizes, all null hypotheses are rejected, by mathematical fact, without having to look at the data.”1

One pretty simple argument is that of sample size. In most contexts, a statistical test is, by definition, more likely to be declared significant by simply amassing more data, regardless of what the actual effect size is. This, on the other hand, is completely mechanical and dissociated from the real-world context in which the test is being run. Thus, it prioritizes quantity over substance, and when blindly used, potentially promotes results that may lack practical meaning.

“…some cause of natural selection may have a high probability of replicability in additional samples but be trivial. Yet a cause may have a low probability of replicability but be important. This is what we mean when we say that a test of significance is neither necessary nor sufficient for a finding of importance”1

It also tends to shift focus to attaining statistical significance and using it as a filter, causing the potential to miss meaningful insights that didn’t reach this level.

We don’t believe in “zero-sized” effects

“Real scientists draw a line between what is large and small.”1

There is a major contradiction that arises.

The typical hypothesis test is conducted under the assumption of a null hypothesis positing no effect. For example, in calculating the p-value above, it is assumed that there is no difference in all-cause mortality rates between the treatment groups. However, I would argue that in any practical context, it’s rare that someone would genuinely believe in the existence of precisely zero effect. Rather, it would stand to reason that what they really mean is “effectively zero” effect, something so small that it is considered inconsequential.

Herein lies the contradiction: they have now acknowledged some level of substantive significance, albeit undefined. If the true effect happens to be smaller than this threshold, as we just explained, the estimate will still eventually be declared statistically significant with mathematical certainty no matter how minuscule, thus inevitably crossing the unspoken threshold of substantive meaning. Therefore, this begs into question the value of attaining statistical significance at all in favor of the need for explicit consideration of the real-world implications (i.e., the loss function). At the very least, the substantive threshold should be identified and reflected in the null hypothesis so that the p-value is calibrated for substance.

The fallacy of the transposed conditional

This is where it gets especially interesting. There are logical errors with the conclusions drawn from hypothesis testing. I think the best way to describe it is jumping into the classic example that arises in Jacob Cohen’s The Earth Is Round (p < .05) from 1994:

“The incidence of schizophrenia in adults is about 2%. A proposed screening test is estimated to have at least 95% accuracy in making the positive diagnosis (sensitivity) and about 97% accuracy in declaring normality (specificity)…With a positive test for schizophrenia at hand, given the more than .95 assumed accuracy of the test, the probability of a positive test given that the case is normal is less than .05, that is, significant at p < .05. One would reject the hypothesis that the case is normal and conclude that the case has schizophrenia, as it happens mistakenly, but within the .05 alpha error. But that’s not the point. The probability of the case being normal, given a positive test, is not what has just been discovered however much it sounds like it and however much it is wished to be. It is not true that the probability that the case is normal is less than .05, nor is it even unlikely that it is a normal case. By a Bayesian maneuver, this inverse probability, the probability that the case is normal, given a positive test for schizophrenia, is about .60!”5

The desired interpretation of a statistically significant result induces a technical problem. The p-value provides the likelihood of observing the data under the assumption that the null hypothesis is true (a single state of the world), yet we want to interpret it as evidence about the parameter of interest given the data. After all, we did collect it, and want that to be the basis of our conclusions. But that is not the probability we have concerned ourselves with. Using the p-value as a singular basis to determine significance disregards all other possibilities that the true parameter could be. When those possibilities are imbalanced (as they were here, since only 2% of the population had schizophrenia), it confuses which state of the world is most likely given the data with how likely the data is given a state of the world (#5).

What to do instead?

“Real science, unlike significance-testing science, is difficult. If it were not, it would not be real science, but instead it would be already established routine. Real science asks you to make real scientific judgements and real scientific arguments within a community of other scientists. It asks you to be quantitatively persuasive, not to be irrelevantely mechanical. Life is hard.”1

It’s a scary thing to think about. Suppose statistical significance isn’t there to bail you out. What are you supposed to do? How do you know if your results matter or not? I think this passage gives a pretty clear answer:

“She can test her belief in the price effect by looking at the magnitudes, using, for example, the highly advanced technique common in data-heavy articles in physics journals: ‘interocular trauma’. That is, she can look and see if the result hits her between the eyes.”1

The premise of this article has been that the implications of statistical results are context-dependent, so there isn’t a one-size-fits-all alternative to replace statistical significance. Rather than seeking a systematic approach, the emphasis should be placed on cultivating understanding of the subject matter. It’s akin to relying on intuition, like a feeling of “knowing” that you’ve gotten what you needed. Take this simple analogy: a tape measure is a tool that quantifies information needed to inform subsequent action, and the precision of the measurement is tailored to the specific needs of the task at hand. Sometimes a rough estimate is sufficient, while other times meticulous precision is necessary. The goal is to reach the point where, intuitively, you “know” that you’ve obtained the necessary information to move forward confidently. I see statistics as the same thing. Merely a tool to be used to quantify the desired information needed to inform (i.e., augment, not determine) a decision.

Now I’m not going to claim that I haven’t repeatedly violated the practices I’m arguing against, it’s hard not to, but these are things that I’m going to focus more on moving forward instead of p-values and statistical significance:

1. Estimation & magnitude

This is probably the easiest change to start making because it doesn’t require an overhaul of statistical methods, but rather just a shift in focus to the magnitude of the estimates. By deliberately avoiding p-value calculations (and, when reading and consuming research, simply ignoring the concept of statistical significance altogether), the interpretation is governed by (a plausible range of) effect sizes, untainted by arbitrary, context-agnostic significance thresholds, and thus forces a scientific argument to be made on that basis. With a little extra brain power (and humility), this creates a much more contextually-rich, informative interpretation.

2. Bayesian thinking & causal modeling

Richard McElreath’s Statistical Rethinking really convinced me that causal inference powered by Bayesian estimation is probably the best framework out there for scientific modeling (and I’ve only made it through the first couple of chapters so far). It completely shifts the focus from the data itself to the data-generating process, putting the bulk of the hard work upfront, before data is collected, with a focus on mechanism and structure. It also addresses the fallacy problem. However, it’s definitely harder to start doing on a whim.

First of all, the math itself is different from typical frequentist methods, so there is a learning curve there. More difficult though is navigating the practical complexities, such as properly eliciting the necessary subject matter expertise and piecing that together into coherent prior distributions and causal models. Nevermind the technical reasons why that is hard, it is simply more demanding from a time, brainpower, and collaboration perspective–and everyone is busy. Nevertheless I think it is a worthy pursuit (#6).

3. Decision-making & course of action

This is where the loss function is most relevant.

Instead of contorting a generic statistical result to tenuously align with real-world implications, I want to be more deliberate. The first step is to target and understand the tangible decision-making processes that the estimates seek to inform, with an identification of the current standards including practical constraints and nuances. Then, rather than passively using standard techniques, deriving tailored statistical methods to facilitate that usage, which may prompt more rigor, customization, or reframing of the statistical problem entirely to suit the specific context at hand. Estimation uncertainty can be fed as input into hypothetical scenarios to gain insight into where/what actions will be triggered and their subsequent downstream effects on the hard outcomes intended to be impacted. At that point, the significance will be clear.

Focus on the end-product

I think a critical piece to this endeavor is to not only focus on the statistics, but also how they will be disseminated. This means specifying the vehicle that will deliver the information to the right person at the right time. The emphasis on something tangible elicits certain practical and technological constraints that may be otherwise unbounding when focusing solely on the math. Further, this perspective acknowledges that the statistical methods are only a fragment of the overall data product, and may be direct cause for further refinement of the statistical approach itself. That is, even with robust statistical methods or results, the information may lose its utility if poorly conveyed or implemented. This could be due to anything from data pipelines and visualization to deployment and computing resources. This also enables the ability to be more forward-thinking about success measures and accountability/validation schemes like continuous monitoring to ensure sustained yet impactful presence in the intended decision-making context.

Some historical gold

To conclude this, I wanted to highlight an excerpt from the chapter The Psychology of Psychological Significance Testing in the book that I found especially fascinating about the propagation of statistical significance across university education in the United States (pages 142-143):

“In this context the 5 percent science was promoted by the new leaders of quantitative psychology and education. European humanists can score themselves by how many generations they are removed from Hegel–that is, in being taught by a teacher who was taught by a teacher who was taught by a teacher who was taught by Hegel at the University of Berlin. Likewise, statisticians can score themselves by how many generations they are from Fisher. Quinn McNemar, for example, of Stanford University, was an important teacher of psychologists who had himself studied statistical methods at Stanford with Harold Hotelling, the chief American disciple of Fisher. Hotelling had worked directly with Fisher. McNemar then taught L.G. Humphreys, Allen Edwards, David Grant, and scores of others. As early as 1935 all graduate students in psychology at Stanford, following the model of Iowa State, were required to master Fisher’s crowning achievement, analysis of variance. Already by 1950, Gigerenzer et al. reckon, about half of the leading departments of psychology required training in Fisherian methods.

Even rebels against Fisher were close to him, starting with [William Sealy] Gosset himself. Palmer Johnson of the University of Minnesota studied with Fisher in England, though he later had the bad taste to write articles with Fisher’s erstwhile colleague and eternal enemy Jerzy Neyman, whom Fisher had cast into outer darkness. George Snedecor, an agricultural scientist at Iowa State University at Ames, was a cofounder of the first department of statistics in the United States. His important book Statistical Methods was influenced directly by Fisher himself, who somewhat surprisingly was in the 1930s a visiting professor of statistics at Iowa State. One can think of the Iowa schools then [1940s and 1950s] as one thinks of London’s Gower Street in the 1920s and 1930s–a crucial crossroads of statistical methods and training. In a eulogy for S.S. Wilks, a student in the late 1920s of Henry L. Rietz and Allen T. Craig at the University of Iowa, Frederick Mostellar said that Iowa was then “the center of statistical study in the United States of America”. Rietz, Craig, and Wilks worked closely with Fisher. E.F. Lindquist, the American leader of standardized testing for educators, also of the University of Iowa, was deeply influenced by Snedecor. Lindquist invented the Iowa Test of Basic Skills for schoolchildren. He too spent time with the great man.

Some psychologists knew about the work of Neyman and Pearson and some even about that of the Bayesian Harold Jeffreys. But textbook authors, editors, and teachers–inspirited by Fisher’s promise of raising their fields to the level of hard science–helped Fisher win the day. Statistical education narrowed at the same time as it spread. Decision theory and inverse probability, and Gosset’s views on substantive significance, alternative hypotheses, and power, were pushed aside. Too introspective for the hard-boiled.”

It seems as if Fisher’s mechanization of statistical significance is what ultimately enabled statistics to branch out as its own field of study (and that it took place in Iowa is a fun fact). It makes you wonder how this separation contributed to the subsequent growth of scientific inquiry, results, and knowledge by disrupting the synergy between the intuition held by the practitioner and the intricacies of statistical nuance. While the popular notion of “playing in everyone’s backyard” is commonly portrayed as an advantage (which it is pretty cool), upon closer reflection, it might be a fundamental issue. William Sealy Gosset, a.k.a Student, and the inventor of the t-test, was first and foremost, a brewer of Guinness beer, and clearly prioritized substantive meaning:

“Fisher, not the great transcendent, invented the 5 percent philosophy. By contrast, Gosset’s economic approach to uncertainty prevented him from being able to stop thinking at .05 for fear he’d lose too much information, and profits.”1

“World War I had been under way for more than a year when Gosset–who wanted to serve in the war but was rejected because of nearsightedness–wrote to his elderly friend, the great Karl Pearson: ‘My own war work is obviously to brew Guinness stout in such a way as to waste as little labor and material as possible, and I am hoping to help to do something fairly creditible in that way.’ It seems he did.”1

He had a problem to solve: “to brew the best tasting stout at a satisfying price.”. My takeaway: be like Gosset.

Side notes

  1. I don’t think this has much to do with statistical advancement, but rather the experience of observing its implications over time.

  2. By error, I’m talking about the inevitable consequences of statistical analysis in the real-world. Data is messy and inaccurate, samples contain unintended biases and nuances, and estimation methods always produce a much more simplified version of reality. It probably doesn’t need to be repeated, but as George Box famously said, “all models are wrong, some are useful”.

  3. In the article, they defined COVID-19 misinformation as “assertions unsupported by or contradicting US Centers for Disease Control and Prevention (CDC) guidance on COVID-19 prevention and treatment during the period assessed or contradicting the existing state of scientific evidence for any topics not covered by the CDC”.

  4. To give them the benefit of the doubt, they also use a “certainty of evidence” criteria in their decision making which is meant to rate the confidence they have in the result with respect to estimation accuracy, risk of bias, etc. However, the conclusion that there is “no effect” seems questionable to say the least, and that suggesting otherwise is misinformation is asinine.

  5. Search for the ‘Quinn is dead’ quote below for another intuitive example of the fallacy of the transposed conditional.

  6. A couple other points on Bayesian modeling. First, on sample size. The required number of samples needed to estimate something is N=0. That is, I can get parameter estimates solely based on the prior distributions that are driven by what is already known. Thinking of it this way, the data becomes secondary to the model, and is merely collected as a way to nudge parameters one way or another as more of it comes in. The model always exists, relaying the best available information at that point in time, and I don’t need to wait to cross arbitrary sample size thresholds in order to obtain my estimates. This seems to naturally lend itself better to the scientific process. Second, a criticism of Bayesian modeling is that it is too subjective because individual judgement is being used to inform prior distributions. However, I see this as an unequivocal strength. Frequentist methods (and noninformative priors) are not “objective”. They carry assumptions that we probably wouldn’t see as realistic, it is just convenient to use them. In that sense, they become more arbitrary than utilizing pre-existing knowledge. There is an excellent podcast episode where this is discussed.

My favorite quotes

These are my favorite quotes and passages from the book:

  • “The sizeless scientists have adopted a method of deciding which numbers are significant that has little to do with humanly significant numbers…Imagine that you and your infant child are standing on a sidewalk near a busy street. You have just purchased a hot dog from the street vendor and have safely crossed the street. Scenario 1: You suddenly realize you have forgotten the mustard and if you scurry across the busy street, dodging vehicles, there is a 95% probability you’ll return safe with your mustard. Scenario 2: You forgot your child and you watch as she tries to cross the street herself, if you scurry across the busy street, dodging vehicles, there is a 95% probabiliity you’ll return safe with your child. The sizeless scientist in effect declares ‘they are equally important reasons for crossing the street’” (chapter 0, page 10)

  • “…since the arrival of the desktop computer with its ability to invert big matrices at the punch of a key, ‘checking’ on sampling variability effortlessly…electronic computation of statistical significance has cheapened to near zero…‘Decision’ has become socialized and bureaucratized–heedless of the social margins.” (chapter 0, page 13)

  • “It’s hard to do, unlike calculating t-statistics, which is a simpleton’s parlor game. But actual science at the frontier is supposed to be difficult. If it wasn’t, you wouldn’t be at the frontier.” (chapter 0, page 16)

  • “…some cause of natural selection may have a high probability of replicability in additional samples but be trivial. Yet a cause may have a low probability of replicability but be important. This is what we mean when we say that a test of significance is neither necessary nor sufficient for a finding of importance” (chapter 1, page 26)

  • “Unreasoning anger is a quite common reaction to challenges to the Fisherian orthodoxy.” (chapter 1, page 31)

  • “Significance unfortunately is a useful means toward personal ends in the advance of science…Precision, knowledge, and control. In a narrow and cynical sense statistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing ‘significant’ results. Enjoy promotion.” (chapter 1, page 32)

  • “An arbitrary level of statistical significance is the only standard in force–regardless of size, of loss, of cost, of ethics, of scientific persuasiveness. That is, regardless of oomph.” (chapter 2, page 41)

  • “Gosset’s economic approach to uncertainty prevented him from be able to stop thinking at .05 for fear he’d lose too much information, and profits…[Fisher] turned away from Gosset and sought a mechanical, uniform, and bureaucratic line of demarcation–an ‘impenetrable’ end, to scientific argument. So the insecure sciences, eager to establish an ‘objective basis’ for their research ‘communicable to other rational minds’, were pleased and materially rewarded by Fisher’s 5 percent philosophy…With the low fee he set for them to rise to the rank of Sciences with a big S…” (chapter 3, page 46)

  • “Fisher’s procedure appeals to scientists uncomfortable with any sort of argument…To avoid debate they seek certitude such as statistical significance. The unhappy result is that mere opinion and unargued crankery are more likely to rule the sizeless sciences, not less…A technique that was supposed to end arguments has in fact merely concealed the arguments behind a facade of testing that does not test.” (chapter 3, page 47)

  • “‘The goal of an empirical economist should not be to determine the truthfulness of a model but rather the domain of its usefulness’ [Edward Leamer]” (chapter 3, page 52)

  • “Ten million tests of significance, in economics, done annually. If the ten million tests were in fact as conclusive as their own rhetoric requires, whether accepting or rejecting, then nearly every issue in economics would long since have been settled. By now there would therefore be far fewer tests per year, not, as is the case, more and more.” (chapter 3, page 53)

  • “Real scientists draw a line between what is large and small.” (chapter 3, page 54)

  • “Real science, unlike significance-testing science, is difficult. If it were not, it would not be real science, but instead it would be already established routine. Real science asks you to make real scientific judgements and real scientific arguments within a community of other scientists. It asks you to be quantitatively persuasive, not to be irrelevantely mechanical. Life is hard.” (chapter 3, page 55)

  • “…seems to be today’s prepublication attitude: merely increase the N [sample size] to get a still lower [standard error]…Notice the implication of such reasoning. It implies that something must be very wrong with the notion that statistical significance is necessary for substantive significance, a preliminary screen in which one puts one’s data.” (chapter 5, page 67)

  • “She can test her belief in the price effect by looking at the magnitudes, using, for example, the highly advanced technique common in data-heavy articles in physics journals: ‘interocular trauma’. That is, she can look and see if the result hits her between the eyes.” (chapter 5, page 72)

  • “‘Pushing’ an economically large though noisily estimated effect is not a misuse–or a ‘stretch’ of professional ethics. It is precisely the ethical thing to do. To argue otherwise is to fall into the mistaken belief that statistical significance can provide a screen through which the results can be put, to be examined then for substantive significance if they make it through the significance screen.” (chapter 7, page 86)

  • “‘Young people have to have careers’ [former editor of the American Economic Review]” (chapter 8, page 89)

  • “Any scientific hypothesis is a matter of being close enough. The decisions the scientist makes on what constitutes ‘closeness’ ‘depend entirely on the special purposes of the investigator’.” (chapter 8, page 97)

  • “Real scientific tests are always a matter of how close to zero or how close to large or how close to some parameter value, and the standard of how close must be a substantive one, inclusive of tolerable loss.” (chapter 9, page 98)

  • “…‘the overall benefit-cost ratio for the Employer Experiment is 4.29, but it is not statistically different from zero. The benefit-cost ratio for white women…however, is 7.07, and is statistically different from zero…The Employer Experiment affected only white women.’ The 7.07 ratio affects, they said, the 4.29 did not. This is a mistake. The best guess of the researchers was that the state got $4.29 for every dollar spent. The estimate was fuzzy, speaking of random sampling error alone. But that does not mean it is to be taken as zero.” (chapter 9, page 99)

  • “Notice the respect for the approximate nature of social statistics in his very phrasing of ‘around 0.4’ instead of the 0.40768934 that his computer undoubtedly spewed out.” (chapter 9, page 101)

  • “Real science changes one’s mind. That’s one way to see that the proliferation of unpersuasive significance tests is not real science.” (chapter 9, page 101)

  • “At high sample sizes, all null hypotheses are rejected, by mathematical fact, without having to look at the data. No magic of instrumental variables is going to change that.” (chapter 9, page 104)

  • “‘Caution, common sense, and patience…are quite likely to keep [the experimenter] more free from error…than the man of little caution and common sense who guides himself by a mechanical application of sampling rules. He will be more likely to remember that there are sources of error more important than fluctuations of sampling.’” (chapter 10, page 114)

  • “‘It is possible for a result to be useful and possess wide standard error. A result obtained by definitions and techniques drawn up with care, and carried out by excellent interviewing and supervision may have wide standard error because the sample was small; yet such a result might be well preferable to one obtained with a bigger sample, with a smaller standard error, but whose definitions, techniques, and interviewing were out of line with best practice and knowledge of the subject matter.’ [W. Edwards Deming]” (chapter 10, page 117)

  • “It’s embedded like a tax code in the bureaucracy of science.” (chapter 11, page 124)

  • “…why actually replicate when the logic of Fisherian procedures gives you a virtual replication without the bother and expense? Why not go ahead and use the alloys F1 and F2 in airplanes? After all, p<.05.” (chapter 11, page 127)

  • “In denying the plurality of overlapping hypotheses, the Fisherian tester asks very little of the data. She sees the world through the lens of one hypothesis–the null.” (chapter 12, page 133)

  • “If you are a Fisherian, the fact of a large sample becomes your problem. You’re deluded, thinking you’ve proved oomph before you’ve considered what it is.” (chapter 12, page 135)

  • “It always depends on the loss, measured in side effects, treatment cost, death rates. The loss to a cool, scientific, impartial spectator will not be the same as the loss to the patient in question…[the balance between Type I/II errors] ‘must be left to the patient, friends, and family’.” (chapter 12, page 137)

  • “Designing experiments to find the maximal and minimal effect size is a better way to get powerful results and to keep the focus where is should be, on the effect size itself…[William Sealy Gosset]: ‘We tend to think of effect size (when we think of it at all) as a fixed and immutable quantity that we attempt to detect. It may be more useful to think of effect size as a manipulable parameter than can, in a sense, be made larger through greater measurement accuracy.’” (chapter 12, page 139)

  • “Some psychologists knew about the work of Neyman and Pearson and some even about that of the Bayesian Harold Jeffreys. But textbook authors, editors, and teachers–inspirited by Fisher’s promise of raising their fields to the level of hard science–helped Fisher win the day. Statistical education narrowed at the same time as it spread. Decision theory and inverse probability, and Gosset’s views on substantive significance, alternative hypotheses, and power, were pushed aside. Too introspective for the hard-boiled.” (chapter 13, page 143)

  • “Fisher wrote in 1955, ‘In the US also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather at , let us say, speeding production, or saving money’. Notice the sneer by the new aristocracy of merit, as the clerisy fancied itself. Bourgeois production and money making, Fisher avers, are not the appropriate currencies of science.” (chapter 13, page 145)

  • “Early on in an elementary statistics or psychometrics or econometrics book there might appear a loss function–‘what if it rains the day of the company picnic?’. But the loss function disappears when the book gets down to producing a formula for science.” (chapter 13, page 146)

  • “Power, simulation, a variety of experiments, triangulation, actual replication, and exploratory data analysis leading to interocular trauma from the effect of magnitudes are different modes of affirming the consequent and are more generally a reasonable program of Gosset or Bayesian and Feynman confirmationism than is the dogma of Fisherian or Popperian falsificationism.” (chapter 13, page 153)

  • “The Fisher test can shed light on the probability that ‘Quinn is dead’ given that ‘Quinn was hanged’. What the Fisher test wants to know and claims to measure is the opposite, the probability that Quinn was hanged, given that Quinn is dead…this probability is close to zero…In a nonhanging society people die for many reasons other than hanging…therefore being dead is very weak evidence indeed that Quinn was hanged…Being dead is ‘consistent with’ the hypothesis that Quinn was hanged as the positivist rhetoric of the Fisherian argument emphasizes. But so what? A myriad of other hypotheses…such as catching pneumonia or breaking your neck in a fall from your horse, are also consistent with it–‘it’ being the fact of being dead.” (chapter 14, page 155)

  • “One of us has an elderly aunt who can sit in the garden of a hot, Indiana summer evening untouched by mosquitoes. She chalks up her immunity to a side effect of a ‘nuclear treatment’ received at midcentury to attack a tumor…Well, who’s to deny her? Medical science since the arrival of Fisher’s methods has had a problem with narrative…people believed that the use of p’s and t’s in the design and evaluation of clinical trials would mark an advance over old wive’s tales, crankery, anecdote, folkways, and fast-talking patent medicine salesmen. The dream of mechanization was as compelling in medicine as it was in war, social work, and philosophy of mind…‘Let the table decide’. At 5 percent the medical scientists suddenly submitted eyes locked hard in a sizeless stare. But the new method is just a mutation of old husband’s tales, statistical crankery, probabilistic anecdote, scientific folkways, and fast-talking, twenty-first-century, statistical patent medicine salesmen.” (chapter 14, page 160)

  • “Even the rare courageous Fisherians do not deign to make a case for their procedures. They merely complain that the procedures are being criticized…being comfortably in control, appear inclined to leave things as they are…If you don’t have any arguments for an intellectual habit of a lifetime perhaps it is best to keep quiet” (chapter 15, page 169)

  • “If one can see or hear the problem, one does not need to rely on correlations…doctors have lost many of their skills of physical assessment, even with the stethoscope (and certainly with their hands) and have come to rely on a medical literature deeply infected with Fisherianism.” (chapter 15, page 175)

  • “The Fisherian tests of significance, the only tests employed by the original authors of the seventy-one studies, literally could not see the beneficial effects of the therapies under study, though staring at them.” (chapter 16, page 179)

  • “The ‘sunshine herb’ [St. John’s wort] is frequently under attack (perhaps, one suspects, because it seems to be a cheap substitute for drugs)…the authors…concluded from the p-value that St. John’s-wort is not clinically effective. Doesn’t help, they said.” (chapter 16, page 182)

  • “‘…They were made on different days at different hours. They all relate to the same nest’. Since Edgeworth had collected his own data, he knew his observations intimately; for example, he controlled exactly for nest and time-of-day heterogeneity, reducing error in observations that cannot be matched with a mere test of statistical significance on a data set downloaded from the Internet, no matter how mathematically advanced the ‘correction’.” (chapter 17, page 189)

  • “Statistical significance can indicate the likelihood of the presence of an effect…But…so what?…Hoover an Siegler want to assign the responsibility to a man they call ‘practical’. Shades of Fisher: the scientist is replaced by a mechanical puppet who acknowledges a signal at p=.05, and the puppet–not the scientist who knows why it might matter–is called ‘practical’.” (chapter 17, page 191)

  • “Statistics was not by any means the primary science on the Gower Street agenda. Biometry, but especially eugenics, was…Pearson’s papers and the archives of the Biometric and Galton labs survive. One finds in them the ephemera of a scientific racism common to the age, and to which Galton, Pearson, and Fisher were leading contributors…Value judgements–arguments about the arguments–and Gosset’s personal probability, were to be kept out of the neighborhood of their new sciences. Pearson would write in the 1920s against Jewish migration to Britain, and Fisher would write in the 1930s against material relief for poor people and literally in favor of relief for the rich on eugenic grounds. Such stuff was in the air…” (chapter 18, page 199)

  • “An early case, applied to the eggs of the cuckoo bird, illustrates literally the feel of substantive as against statistical significance.” (chapter 19, page 203)

  • “There are ways other than getting inside the mind of the victim to know what matters to her. For instance, one could measure with some difficulty and sacrifice (but good science is difficult and sacrificial)…” (chapter 19, page 205)

  • “But Gosset in this study and others often found z or t beside the point. ‘You want to be able to say ’if farmers [or whomever] in general do this [i.e., follow a certain experimental method] they will make money by it’’. A criterion of merely statistical significance could not satisfy such taste. (chapter 20, page 209)

  • “‘Fisher was vague. Karl Pearson was vague. Egon Pearson vague. Neyman vague. Fisher and Neyman were fiery. Silly! Egon Pearson was on the outside. They were all jealous of one another, afraid somebody would get ahead. Gosset didn’t have a jealous bone in his body. He asked the question [about power and alternative hypotheses]. Egon Pearson to a certain extent rephrased the question which Gosset had asked in statistical parlance. Neyman solved the problem mathematically.’ [Florence Nightingale David]” (chapter 20, page 211)

  • “‘There must be essential similarity to ordinary practice…Experiments must be so arranged as to obtain the maximum possible correlation [not the maximum possible statistical significance] between figures which are to be compared [like Leamer and other oomph-ful scientists, Gosset thought in terms of upper and lower bound estimates, best and worst case scenarios]…Repetitions should be so arranged as to have the minimum possible correlation between repetitions (or the highest possible negative correlation)…There should be economy of effory [net pecuniary advantage in the 1905 sense]’ [Student (William Sealy Gosset)]. Fisher shrugged. The economic approach to the design of experiments was too difficult. He never did try Gosset’s way.” (chapter 21, page 216)

  • “An ethical life of science seems to require an emotional life outside of it. ‘…he [Fisher] is glad to discuss…things early in the morning or late at night. But he is not glad or even willing to have others work on the purely theoretical aspects of his work. He expects others to accept his discoveries without even questioning them. He does not admit that anything he ever said or wrote was wrong. But he goes much further than that. He does not admit even that the way he said anything or the nomenclature he used could be improved in any way.’ [Raymond Birge]. Birge told Deming that Fisher was the most conceited man he ever met.” (chapter 21, page 222)

  • “‘Though recognizable as a psychological condition of reluctance, or resistance to the acceptance of a proposition, the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by, other rational minds. The level of significance in such cases fulfils the conditions of a measure of the rational grounds for the disbelief it engenders.’ [R.A. Fisher]” (chapter 21, page 223)

  • “To evaluate size matters/how much would have forced Fisher to listen to and cooperate with others. Determining whether something matters to people depends on actually listening to people, as a heart surgeon listens to a radiologist, as a beer brewer listens to a customer. Admitting that size matters would have required Fisher to admit that regression coefficients ‘are capable of evaluation in any currency’. It would have put him in the unhappy position of having to communicate with others about the meaning of his findings. This, we have shown, he would not do.” (chapter 21, page 224)

  • “Scientists, Fisher said, should ‘not assume’ their research is ‘capable of evaluation’. They must not work to ‘maximize profit’, he said in 1955, only for ‘faith’–a secular faith, he means, in the possibility that another mechanically calculated output of p-values by themselves could contribute to scientific progress. The scientist should not worry…whether their samples are random: just test, test, test, as if random. A 5 percent level of Type I error is, when ‘formally’ considered, says Fisher, the final judge of Science.” (chapter 21, page 226)

  • “It is our experience that the more training a person has undergone in Fisherian methods the less easy it is for her to grasp our very elementary point…People who are highly trained in conventional economics have an especially difficult time. Most of them have no idea what we are talking about, though they are sure they do not approve. By contrast, undergraduates who have never had a statistics course, science and engineering professionals we work with or meet in our travels, businesspeople, musicians, activists, various colleagues in nonstatistical fields…as soon as they are able to grasp that we are not attacking statistics as such…these have no difficulty understanding our point and immediately begin wondering what the controversy is about.” (chapter 23, page 239)

  • “One can take null-hypothesis significance testing as a sort of astrology, giving ‘decisions’ mechanically, justified within the system of astrology itself…Fisherisnism is bad input, straightforwardly misleading advice, erroneous astrology. Misleading advice is not made into good advice merely by its mechanical and pecuniary cheapness.” (chapter 23, page 241/242)

  • “‘Adherence to the rules originally conceived as a means, becomes transformed into an end-in-itself’ [Robert Merton]. That seems about right: statistical significance, originally conceived as a means to substantive significance, became transformed by Fisher and then by bureaucracies of science into an end in itself. A t-tested certified fact will be ‘equally convincing to all rational minds, irrespective of any intentions they may have in utilizing knowledge inferred’.” (chapter 23, page 243)

  • “If we were to assemble our socioeconomic observations into a single chain of thought its strongest link would be coupling Merton’s ‘bureaucracy’ with Hayek’s ‘scientism’. Scientism describes, ‘of course, an attitude which is decidedly unscientific in the true sense of the word, since it involves a mechanical and uncritical application of habits of thought to fields different from those in which they have been formed. The scentistic as distinguished from the scientific view is not an unprejudiced but a very prejudiced approach which, before it has considered its subject, claims to know what is the most appropriate way of investigating it’. [Hayek]. The trick is to unshackle the bureaucracy of scientism, to break its mechanical rules, change its prejudice incentives, create new rituals, train capacity. No simple trick.” (chapter 23, page 244)

  • “They need to acquire the virtues necessary for performing repeated experiments on the same material. They need to hear that random error is one out of many dozens of errors and seldom the biggest.” (chapter 24, page 246)

  • “In science, as against careerism or pure mathematics, it is better to be approximately correct and scientifically relevant than it is to be precisely correct but humanly irrelevant. Not even the fully specified power function, balancing the risk of errors from random sampling, provides a full solution to a scientific problem. In truth, as Kruskal never tired of remarking, statistical ‘significance’ poses no scientific problem at all. With the aid of a personal computer and a grant such significance is easy to achieve.” (chapter 24, page 246)

  • “Statistical scientists can teach substance without sacrificing the rigor they so passionately seek. Real rigor will rise with increased attention to substance.” (chapter 24, page 247)

  • “The textbooks are wrong. The teaching is wrong. The seminar you just attended is wrong. The most prestigious journal in your scientific field is wrong…Science is mainly a series of approximations to discovering the sources of error. Science is a systematic way of reducing wrongs or can be.” (chapter 24, page 251)

  • “Perhaps you feel frustrated by the random epistemology of the mainstream but don’t know what to do. Perhaps you’ve been sedated by significance and lulled into silence. Perhaps you sense that the power of a Rothamsted test against a plausible Dublin alternative is statistically speaking low but are dazzled by the one-sided rhetoric of statistical significance. Perhaps you feel oppressed by the instrumental variable one should dare not to wield. Perhaps you feel frazzled by the ‘social psychological rhetoric of fear’ that keeps the abuse of significance in circulation. You want to come out of it. But perhaps you are cowed by the pretige of Fisherian dogma. Or, worse thought, perhaps you are cynically willing to be corrupted if it will keep a nice job. Repent, we say. Embrace your inner Gosset…‘Who are you going to believe–us or your own lying eyes?’” (chapter 24, page 251)

References

  1. Deirdre McCloskey, Steve Ziliak. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press. 2008. https://doi.org/10.3998/mpub.186351 (subtitle quote: chapter 10, page 112)

  2. Sule S, DaCosta MC, DeCou E, Gilson C, Wallace K, Goff SL. Communication of COVID-19 Misinformation on Social Media by Physicians in the US. JAMA Netw Open. 2023;6(8):e2328928. doi:10.1001/jamanetworkopen.2023.28928

  3. Roman YM, Burela PA, Pasupuleti V, Piscoya A, Vidal JE, Hernandez AV. Ivermectin for the Treatment of Coronavirus Disease 2019: A Systematic Review and Meta-analysis of Randomized Controlled Trials. Clin Infect Dis. 2022 Mar 23;74(6):1022-1029. doi: 10.1093/cid/ciab591. PMID: 34181716; PMCID: PMC8394824.

  4. Diaz GA, Parsons GT, Gering SK, Meier AR, Hutchinson IV, Robicsek A. Myocarditis and Pericarditis After Vaccination for COVID-19. JAMA. 2021;326(12):1210–1212. doi:10.1001/jama.2021.13443

  5. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997