On the Creation of Classical Statistics
I used to have a somewhat cynical view of R.A. Fisher, especially on the motivation for statistical significance (see my previous article). Even though he did explicitly advocate for the use of the 5% threshold:
“If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 and consider that higher values of [the statistic] indicate a real discrepancy.”1
and
“If one in twenty does not seem high enough, we may, if we prefer, draw the line at one in fifty (the 2 percent point), or one in a hundred (the 1 percent point). Personally, the writer prefers to set a low standard at the 5 percent point, and ignore entirely all results which fail this level.”2
After reading Erich Lehmann’s book, Fisher, Neyman, and the Creation of Classical Statistics, I realize there is much more nuance to it, and he probably meant well in his statistical work (his other work, maybe a different story). I’m fairly convinced that he never imagined, nor would approve of, how statistical significance would be used and abused since then.
An experimentation context
Much of the methodology related to hypothesis testing that Fisher developed (along with other fundamental concepts (in 1922) like consistency, efficiency, and sufficiency) was motivated by the specific context he started out in as an agricultural statistician in 1919 at Rothamstead Experimental Station: that of small-sample, randomized experimentation. It is clear in his writings, though maybe implicit, that there were practical things he considered that played into the overall validity of inference, not only whether the p-value crossed a threshold:
“…it is not known whether heterogeneity [of the soil] will be more pronounced in the one or the other direction in which the field is ordinarily cultivated…The effects are sufficiently widespread to make apparent the importance of eliminating the major effects of soil heterogeneity not only in one direction across the field, but at the same time in the direction at right angles to it.”3
He wasn’t proposing that his methods be mechanically applied, or that the method itself is what proves valid inference. Rather, inherent in that quote is the intuition Fisher had about the subject, the “soil heterogeneity”, that made the implications of the design useful for that situation. This, in combination with his obviously deep statistical knowledge, is what I believe ultimately made the usage of seemingly arbitrary significance thresholds valid in Fisher’s eyes. It’s not that he didn’t want statistical analysis to be “easier” for researchers (and he was somewhat back and forth on this):
“However, his early recommendation and life-long practice prevailed. The desire for standardization trumped the advantages of considering each case on its own merit.”4
I think he probably just put too much confidence in the implementers of his work to be as critical, meticulous, and simply as brilliant as he was. He never conceived of the erroneous ways his statistical and design principles would later be used.
He had a mentor
One of the most fascinating aspects of the history of classical statistics is the role of William Sealy Gosset (a.k.a. “Student”, as in Student’s t-test). For his entire career, he was a beer brewer at Arthur Guinness Son and Co. (one of my favorites), yet he is credited with putting forth, through his own curiosity, intelligence, and need of practical solutions for quality control efforts, the ideas of which Fisher would ultimately bring to fruition:
“After a small-sample (”exact”) approach to testing was initiated by Gosset (“Student”) in 1908 with his t-test, Fisher in the 1920’s, under frequent prodding by Gosset, developed a battery of such tests, all based on the assumption of normality. These tests today still constitute the bread and butter of much of statistical practice.”4
That “frequent prodding” Lehmann is talking about, in addition to the timeline, is why I characterize Gosset more like a mentor. Fisher was 14 years younger, but incredibly gifted intellectually.
“He [Gosset] then had the crucial insight that exact results [for a t-test] could be obtained by making an additional assumption…although he was not able to give a rigorous proof. The first proof was obtained (although not published) by Fisher in 1912…as a result of constant prodding and urging by Gosset, he found a number of additional small-sample distributions, and in 1925 presented the totality of these results in his book…getting Fisher to develop this methodology much further than he (Fisher) had originally intended.”4
Fisher was only 22 years old in 1912. It seems Gosset’s wisdom helped him pinpoint the arguments he would come to make, and ultimately gave him the encouragement and motivation to see it through. Without that, who knows if any of it would have been done.
“This passage suggests that Fisher thought these problems to be difficult, and that he had no plans to work on them himself. However, in April 1922 he received two letters from Gosset that apparently changed his mind.”4
Not to mention Gosset’s influence on Neyman’s (and Egon Pearson’s) foundational work regarding the “consideration of the alternatives (suggested by Gosset)”, Fisher did acknowledge his contributions and spoke highly of him.
“…an exact solution of the distribution of regression coefficients…has been outstanding for many years; but the need for its solution was recently brought home to the writer by correspondence with ‘Student’, whose brilliant researches in 1908 form the basis of the exact solution”5
He was in fact, a genius
Despite their “disdain” for one another:
“Both Fisher and Neyman believed that they had made important contributions to the philosophy of science, but each felt that the other’s views were completely wrong-headed.”4
Much of their foundational work was complimentary. Fisher supplied the methodology, Neyman put the rubber stamp on it with mathematical proofs.
The thing that caught my attention that Lehmann mentions multiple times in the book is the way Fisher came up with those methods.
“Fisher’s tests were solely based on his intuition. The right choice of test statistics was obvious to him. A theory that would justify his choices was developed by Neyman and Pearson in their papers in 1928 and 1933.”4
As you read about the progression of his work, it’s like all the fundamental statistical concepts pop-up one by one, and you realize the breadth and depth of Fisher’s accomplishments. The idea that this can be attributed to his “intuition” is just remarkable. It wasn’t just in testing, but also in design:
“…the designs in DOE [The Design of Experiments, 1935] were presented without much justification, based entirely on his intuitive understanding of what the situation demanded. But again later writers found justifications by showing that Fisher’s procedures possessed certain optimality properties.”4
Even when you read Fisher’s passages directly, you get the feeling that it just rolled off his tongue and he was writing down what flowed from his mind. Though what he was writing turned out to be fundamental to statistical practice:
“…much caution should be used before claiming significance for special comparisons… Comparisons suggested by scrutiny of the results themselves are open to suspicion; for if the variants are numerous, a comparison of the highest with the lowest observed value will often appear to be significant, even from undifferentiated material.”3
In this case, the problem with multiple comparisons. This is the general tone of Fisher’s writings, just nonchalantly bringing up things like power, creating block designs, etc. as “obvious” considerations.
Unfortunately, despite all Fisher did achieve, his stubbornness prevented him from achieving more.
“…Fisher rarely gave an inch. Those holding different views from his own had ‘misread’ him and their statements were ‘incorrect’.”4
And subsequently, even though he hinted at it with his idea of “sensitiveness”:
“By not utilizing the idea of power, Fisher deprives himself of the ability to resolve one of the most important issues of experimental design, the determination of sample size.”4
It seems he grew bitter and resentful in older age. For one, Fisher, “the creator of modern statistics”, in his role at University College under Egon Pearson, was “not permitted to teach statistics”. Also, all the progress and innovation in statistics shifted to the United States after Neyman moved there in 1938. People appreciated his foundational work, but they were taking it in a different direction and he was too far away to continue having influence. Nevertheless, his legacy is set in stone.
References
Fisher, R.A. (1925). Statistical methods for research workers. Oliver and Boyd: Edinburgh.
Fisher, R.A. (1926). The arrangement of field experiments. J. Min. Agric. G. Br. 33:503-513
Fisher, R.A. (1935). The Design of Experiments. Oliver and Boyd: Edinburgh.
Lehmann, Erich L (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer New York, NY. https://doi.org/10.1007/978-1-4419-9500-1
Fisher, R.A. (1922). The goodness of fit of regression formulae, and the distribution of regression coefficients. J. Roy. Statist. Soc., 85: 597-612