Misconceptions About P Values and The History of Hypothesis Testing

1. Introduction

The misconceptions surrounding p values are an example where knowing the history of the field and the mathematical and philosophical principles behind it can greatly help in understanding.

The classical statistical testing of today is a hybrid of the approaches taken by R. A. Fisher on the one hand and Jerzy Neyman and Egon Pearson on the other.

P-value is often confused with the Type I error rate – {\alpha} of the Neyman-Pearson approach.

In statistics journals and research, the Neyman-Pearson approach replaced the significance testing paradigm 60 years ago, but in empirical work the Fisher approach is pervasive.

The statistical testing approach found in the textbooks on statistics is a hybrid

2. Fisher’s Significance Testing

Fisher held the belief that statistics could be used for inductive inference, that “it is possible to argue from consequences to causes, from observations to hypothesis” and that it it possible to draw inferences from the particular to the general.

Hence he rejected the methods in which probability of a hypothesis given the data, {\mathop{\mathbb P}(H \vert x)}, are used in favour of ones in which probability of data, {\mathop{\mathbb P}(H \vert x)}, given a particular hypothesis are used.

In his approach, the discrepancies in the data are used to reject the null hypothesis. This is done as follows:

The researcher sets up a null hypothesis, which is the status quo belief.

The sampling distribution under the null hypothesis is known.

If the observed data deviates from the mean of the sampling distribution by more than a specified level, called the level of significance, then we reject the null hypothesis. Otherwise, we “fail to reject” the null hypothesis.

The p-value in this approach is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true.

There is a common misconception that if {p = .05}, then the null hypothesis has only a 5% chance of being true.

This is clearly false, and can be seen from the definition as the P value is calculated under the assumption that the null hypothesis is true. It therefore cannot be a probability of the null hypothesis being false.

Conversely, a p value being high merely means that a null effect is statistically consistent with the observed results.

It does not mean that the null hypothesis is true. We only fail to reject, if we adopt the data based inductive approach.

We need to consider the Type I and Type II error probabilities to draw such conclusions, which is done in the Neyman-Pearson approach.

3. Neyman-Pearson Theory

The main contribution of Neyman-Pearson hypothesis testing framework (they named it to distinguish it from the inductive approach of Fisher’s significance testing) is the introduction of the

  • probabilities of committing two kinds of errors, false rejection (Type I error), called {\alpha}, and false acceptance (Type II error), called {\beta}, of the null hypothesis.
  • power of a statistical test. It is defined as the probability of rejecting a false null hypothesis. It is equal to {1 - \beta}.

Fisher’s theory relied on the rejection of null hypothesis based on the data, assuming null hypothesis to be true. In contrast, the Neyman-Pearson approach provides rules to make decisions to choose the between the two hypothesis.

Neyman–Pearson theory, then, replaces the idea of inductive reasoning with that of, what they called, inductive behavior.

In his own words, inductive behaviour was meant to imply: “The term ‘inductive behavior’ means simply the habit of humans and other animals (Pavlov’s dog, etc.) to adjust their actions to noticed frequencies of events, so as to avoid undesirable consequences”.

Then, in this approach, the costs associated with Type I and Type II behaviour determine the decision to accpet or reject. These costs vary from experiment to experiment and this thus is the main advantage of the Neyman-Pearson approach over Fisher’s approach.

Thus while designing the experiment the researcher has to control the probabilities of Type I and Type II errors. The best test is the one that minimizes the Type II error given an upper bound on the Type I error.

And what adds to the source of confusion is the fact that Neyman called the Type I error the level of significance, a term that Fisher used to denote the p values.

The Development of Mathematical Analysis

1. Reasons Behind The Creation of Analysis

Newton had approached his calculus with fluxions and flows while Leibniz had done with differentials.

Both methods however had to deal with infinities and infinitely small quantities, specifically to find methods that combine infintely many infinitely small quantities to get finite quantities.

The reasoning used to arrive at the methods by Newton and Leibniz was vague. For example, Leibniz correctly stated the result that:

\displaystyle  d(uv) = udv + vdu. \ \ \ \ \ (1)

The argument used to derive the result is as follows: Since,

\displaystyle  d(uv) = (u + du)(v + dv) - uv \\ = udv + vdu + dudv \ \ \ \ \ (2)

The term {du\thinspace dv} is a second order differential and thus extremely small compared to the first order differentials and is thus treated as 0.

So there is an inconsistency in the way that differentials are treated. The terms to ignore are arrived at not by reasoning but by back tracking from the correct answer.

Another problem was the treatment of divergent series.

For example, it was proven that

\displaystyle  1 - x + x^2 - x^3 + \dots = 1 / (1 + x) \ \ \ \ \ (3)

But it was not clear why this result breaks down when {x = 1}:

\displaystyle  1 - 1 + 1 - 1 + 1 - 1 + \dots \neq 1/2 \ \ \ \ \ (4)

The works of the leading mathematicians of that time (before the development of what we know today as analysis) – Euler, Daniel Bernoulli, Taylor, D’Alembert etc. included many arguments of suspect validity.

The mathematicians who solved all these problems and the works of whom led to the creation of analysis were (in chronological order) – Lagrange, Cauch, Reimann and Weierstrass. Their contributions are discussed below:

2. Lagrange

The initial development of calculus had relied heavily on geometry and geometrical methods and visualization.

Lagrange brought about a shift in treatment of calculus from the geometrical approach to “algebraic analysis” of “analytic functions”.

Analytic function initially had meant any function with a single expression, but Lagrange corrected it to a function which has a convergent Taylor series representation.

While he as right in defining the analytic functions, the arguments he used turned out to be unextendable and thus untenable. This was corrected by Cauchy.

3. Cauchy

In the first decades of the nineteenth century, Cauchy revived the limit approach to analysis and gave the definitions of continuity and derivatives in terms of limits.

His greatest contribution to the field of analysis was that he gave clear definitions. For example, he defined the sum of an infinite series as the limit of the sequence of partial sums. This provided a unified approach for series of numbers as well as of functions.

Cauchy also gave the correct definition of continuity, defining it to be on an interval rather than on a point.

His definition of definite integral assumed continuity, which was corrected by Reimann.

Abel found an error in his work that led to the notion of uniform convergence as being different from convergence.

4. Reimann

As already noted, Reimann’s main contribution to analysis was the generalization of the definition of definite integral to include discontinuous functions.

As an example, he showed a function, discontinuous on any interval, having an integral.

Dirichlet had corrected an error in Cauchy’s work. While studying the conditions under which the Fourier series expansion of a function converges to the function, Dirichlet succeeded in proving such convergence for a function that has period {2\pi}, is integrable on an interval of that length, does not have infinitely many maxima and minima, and at jump discontinuities takes on the average value between the two limiting values on each side.

Reimann was able to give the generalized definition the definite integral by extending Dirichlet’s method to more cases.

His definition disspelled the idea that integration was the inverse of differentiation, although the relationshiop still held for continuous functions.

5. Weierstrass

The two main contributions by Weierstrass are the elimination of the idea of motion from limit processes and the representation of functions.

The new definition of limit without the idea of motion was based on what is now called the topology of the real line or complex plane.

He also developed different classes of functions by using their power series representations.

He also produced counter examples to critique the work of other mathematicians and to point out errors.

One famous example is of the nowhere differentiable but everywhere continous function:

\displaystyle  f(x) = \sum b^n \cos(a^n x) \ \ \ \ \ (5)

His approach of producing pathological examples lifted the precision of hypothesis in analysis substantially.