*Check out Bold Signals 3.01 for an audio version of this piece. Much of this material was also appears in “Scenes from a Replication Crisis“.*

It’s 1925.

Ronald Fisher is a geneticist and statistician working at Rothamsted Experimental Station, an agricultural research institute located in the English countryside.

Before coming to Rothamsted, Fisher was instrumental in reconciling Charles Darwin’s notion of evolution by natural selection with Gregor Mendel’s Laws of Genetics. Basically, if you’ve ever wondered how Darwin’s observations of Finches and Mendel’s experiments with pea plants led to our modern understanding of evolution, one of the people you have to thank for that is Ronald Fisher.

It’s also worth pointing out, given Fisher’s influence on genetics, that he was an outspoken eugenicist. After all, this was the early twentieth century and the history of science is not exactly a straight line of people or non-horrific views on society.

Anyway, back to the countryside.

Long-term experiments with wheat, grass, and roots abound at Rothamsted, giving Fisher a bumper crop of data to analyze. However, though the overall quantity of data is high, sample sizes are low. An influential study of the effects of rainfall on wheat incorporates data from just thirteen plots of land.

Concerned with generalizing the results of such experiments, after all, the point of this type of research is to increase crop production, Fisher synthesizes several recent advances in “small sample statistics” into a framework known as significance testing.

He takes a statistical test called the Student’s t-test, which was initially developed by statistician to monitor the quality of Guinness, and develops a complementary test known which he calls the Analysis of Variance (ANOVA).

To ensure these innovations are accessible to the research community beyond Rothamsted, Fisher publishes *Statistical Methods for Research Workers*. Central to the book, and significance testing more generally, is the null hypothesis- the position that there is no significant difference between groups of data. In Fisher’s conception, devices like t-tests and ANOVAs are tests of the null hypothesis. The results of such tests indicate the likelihood of observing a result when the null hypothesis is true. In quantitative terms, this likelihood is expressed as a p-value.

Fitting it’s origins in applied research, the utility of Fisher’s framework is best demonstrated with a practical example. Suppose Fisher and his colleagues want to study the effect of a particular method of fertilization on the growth of grass. To do this, they obtain yield measurements from ten plots that use the method and ten that do not. These numbers are small, but reflective of the time and effort that goes into harvesting good data. Before examining the two groups of data, Fisher reminds his colleagues that the null hypothesis stipulates that there is no difference between the fertilized and unfertilized plots. This is a really abstract way of talking about something as exciting as watching grass grow, so he reiterates that the null hypothesis is essentially that the fertilization method has no effect. Then, he runs a t-test.

A resulting p-value of 0.50 indicates that, assuming the fertilization method has no effect, the probability of Fisher and his colleagues obtaining their yield measurements is fifty percent. A resulting p-value of 0.10 indicates that the probability is ten percent. In Statistical Methods for Research Workers, Fisher introduces an informal criterion for rejecting the null hypothesis: p < 0.05.

*“The value for which p = 0.05, or 1 in 20, is 1.96 or nearly 2 ; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.”*

We’ve arrived finally at p<0.05

Almost a decade after the publication of Statistical Methods for Research Workers, Jerzy Neyman and Egon Pearson address what they view as a fundamental asymmetry in Fisher’s framework. Namely, though it’s intended to help researchers evaluate the results of experiments, the focus on null hypotheses doesn’t really give researchers any way to evaluate experimental hypotheses. Basically, the argue with increasing volume, you can use Fisher’s methods to evaluate if there’s a difference between two groups- but you can’t used it to make a statement about what’s causing it.

Though their “hypothesis testing” framework draws heavily from Fisher’s, Neyman and Pearson’s has a fundamentally different goal. Rather than giving researchers tools to evaluate the results of agricultural experiments, their goal is determining the most optimal test for deciding between competing hypotheses. These hypotheses include Fisher’s null hypothesis, but also a variety of “alternative” or experimental hypotheses. To this end, they introduce three important concepts to the burgeoning field of research-oriented statistics: Type I Error- The probability of incorrectly rejecting the null hypothesis, Type II Error- the probability of incorrectly accepting the null hypothesis, and Power- the probability of correctly rejecting the null hypothesis correctly.

Disagreements between Fisher and Neyman and Pearson soon escalates into open antagonism. No seriously, reading accounts of these debates you get the sense that Fisher’s true talent wasn’t in biology or statistics, but in expressing his ego mostly through yelling.

However, despite the controversy, the two frameworks are soon combined and presented as one in research methods textbooks. What emerges is an enormously and immediately influential model of statistical testing that incorporates Pearson’s null hypothesis, Neyman and Pearson’s alternative hypotheses, and a focus on observing p-values less than 0.05.

So when we talk about p-values and things like p-hacking, we’re talking about a method for evaluating the difference between groups of data that was designed by an evolutionary biologist and eugenicist for use in agriculture. We’re also talking about a debate about how what this number means and how to use it that has been ongoing for more than ninety years.

**Additional Reading**

Box, J. F. (1987). Guinness, Gosset, Fisher, and small samples. *Statistical Science, 2*(1), 45-52.

Halpin, P. F., & Stam, H. J. (2006). Inductive inference or inductive behavior: Fisher and Neyman-Pearson approaches to statistical testing in psychological research (1940-1960). *The American Journal of Psychology, 119*(4), 625-653. doi: 10.2307/20445367

Fisher, R. A. (1918). The correlation between relatives on the supposition of Mendelian inheritance. *Transactions of the Royal Society of Edinburgh, 52*, 399-433.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222*, 309-368. doi: 10.1098/rsta.1922.0009

Fisher, R. A. (1925). *Statistical methods for research workers*. Oliver and Boyd.

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. *Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, *231, 289-337. doi: 10.1098/rsta.1933.0009

Lenhard, J. (2006). Models and statistical inference: The controversy between Fisher and Neyman–Pearson. *The British Journal for the Philosophy of Science, 57*(1), 69-91. doi: 10.1093/bjps/axi152

Student. (1908). The probable error of a mean. *Biometrika**, 6*(1), 1-25.