Type Theory – The Untyped Lambda Calculus

The Untyped Lambda Calculus or the pure lambda calculus forms the computational substrate for most of the type systems.

Peter Landin observed that complex programming languages can be understood as having a tiny core with all the complex features implemented in the tiny core.

Lambda Calculus can be viewed as a simple programming language or as a mathematical object about which statements can be proved.

Everything is a function.

Definition 1 Lambda Term: It refers to any term in the lambda calculus.

Definition 2 Lambda Abstraction: Lambda terms beginning with a {\lambda} are called lambda abstractions.

There are three kind of terms in the lambda calculus.

  1. A variable {x} itself is a term.
  2. The abstraction of a variable from a term written as {\lambda{x.t}} is a term called a lambda abstraction.
  3. The application of a term to another term is written as {t t}.

The grammar can be summarized as:

Syntax:

\displaystyle  {t} := \text{terms} \\ := {x} \thickspace \text{variable} \\ := \lambda {x.t} \thickspace \text{lambda abstraction} \\ := {t \thickspace t} \thickspace \text{application} \ \ \ \ \ (1)

To save writing too many parenthesis the convention is that the

  1. application associates to the left, that is, is {s t u} is {(s t) u}.
  2. abstraction extend as far to the right as possible. For example, {\lambda{x.} \lambda{y.x \thickspace y \thickspace x}} is {\lambda{x.(} \lambda{y.(x \thickspace y) \thickspace x)}}.

1. Scope

A variable is said to be bound when it occurs in the body {t} of an abstraction {\lambda{x.t}}.

A variable is said to be free if it is not bound by enclosing abstraction on {x}

An abstraction with no free variables is said to be closed or a combinator.

2. Operational Semantics

In its pure form, lambda calculus has no mechanism for computation except function application, functions are applied to arguments which are themselves functions. Each step in computation is rewriting an application by replacing the value of the bound variable of the lambda abstraction with the argument given on the right side of the application.

\displaystyle  (\lambda x.{t}_{12}){t}_2 \rightarrow [{x} \thickspace \rightarrow \thickspace {t}_2] {t}_{12} \ \ \ \ \ (2)

There can be many different evaluation strategies.

  • Full-beta Reduction: Any redex may be reduced at any time.

    \displaystyle  (\lambda x.x)((\lambda x.x)(\lambda z.(\lambda x.x) \thickspace z)) \ \ \ \ \ (3)

  • Normal Order Strategy: The outermost redex is reduced first. It is a partial function and each evaluation gives a unique term.
  • Call by Name: In addition to the outermost redex being reduced first, no evaluation is allowed inside abstractions.
  • Call by Need: A reduction relation on abstract syntax graph, instead of abstract syntax trees.
  • Call by Value:

    3. Multiple Arguments

    \displaystyle  \lambda (x \thickspace y).s = \lambda x. \thickspace \lambda y. \thickspace s \ \ \ \ \ (4)

    4. Church Booleans

    \displaystyle  {tru} \thickspace = \lambda {t.} \thickspace \lambda {f.} {t} \\ {fls} = \lambda {t.} \lambda {f.} {f} \\ {test} = \lambda {l.} \lambda {m.} \lambda {n.} {l m n} \\ {test \thickspace b \thickspace v \thickspace u} = {b \thickspace v \thickspace u} \\ {and} = \lambda {b.} \lambda {c.} {b \thickspace c \thickspace fls} \ \ \ \ \ (5)

    5. Pairs

    \displaystyle  {pair} = \lambda {f.} \lambda {s.} \lambda {b.} {b \thickspace f \thickspace s} \\ {fst} = \lambda {p.} {p \thickspace tru} \\ {snd} = \lambda {p.} {p \thickspace fls} \\ {fst \thickspace (pair \thickspace v \thickspace w)} = (\lambda {p.} {p \thickspace tru})((\lambda {f.} \lambda {s.} \lambda {b.} {b \thickspace f \thickspace s}) { v \thickspace w}) \\ = (\lambda {p.} {p \thickspace tru})((\lambda {s.} \lambda {b.} {b \thickspace v \thickspace s}) { w}) \\ = (\lambda {p.} {p \thickspace tru})(\lambda {b.} {b \thickspace v \thickspace w}) \\ = (\lambda {b.} {b v w}) { tru} \\ = ({tru \thickspace v \thickspace w}) \\ = {v} \ \ \ \ \ (6)

    6. Church Numerals

    \displaystyle  {c}_0 = \lambda {s.} \lambda {z.} {z} \\ {c}_1 = \lambda {s.} \lambda {z.} {s \thickspace z} \\ {c}_2 = \lambda {s.} \lambda {z.} {s \thickspace (s \thickspace z)} \\ {c}_3 = \lambda {s.} \lambda {z.} {s \thickspace (s \thickspace (s \thickspace z))} \\ {scc} = \lambda {n.} \lambda {s.} \lambda {z.} {s \thickspace (n \thickspace s \thickspace z)}. \ \ \ \ \ (7)

    A number n is encoded by the action of applying a function s to z n times.

    7. Recursion

    \displaystyle  {omega} = (\lambda {x.} {x \thickspace x})(\lambda {x.} {x \thickspace x}) \\ {fix} = \lambda {f.} ( \lambda {x.} {f} ( \lambda {y.} {x \thickspace x \thickspace y} ) ) ( \lambda {x.} {f} ( \lambda {y.} {x \thickspace x \thickspace y} ) ) \ \ \ \ \ (8)

    \displaystyle  {factorial} = {body \thickspace containing \thickspace factorial} \\ {g} = \lambda {fct.body \thickspace containing \thickspace fct} \\ {g} = \lambda {fct.(}\lambda {n.if \thickspace (cnd) \thickspace then \thickspace (c1) \thickspace else \thickspace (times \thickspace n \thickspace (fct \thickspace (prd \thickspace n))))} \\ {factorial} = {fix \thickspace g} \\ {h1} = \lambda {x.g(}\lambda {y.x \thickspace x \thickspace y)} \\ {fct} = \lambda {y.h1 \thickspace h1 \thickspace y} \\ {factorial c_3} = {fix \thickspace g \thickspace c}_3 \\ = {h1 \thickspace h1 \thickspace g \thickspace c_3 } \ \ \ \ \ (9)

    factorial = fix (lambda fct. body containing fct)

    8. Formalities

    Definition 3 The set of terms is the smallest set such that

    1. If {x \in \mathcal{V}}, then {x \in \mathcal{T}}.
    2. If elements {x, t \in \mathcal{T}}, then {\lambda {x.t} \in \mathcal{T}}.
    3. If {{u, v} \in \mathcal{T}}, then {{u v} \in \mathcal{T}}.

    Definition 4 The set of free variables of a term {t}, written as {FV(t)} is defined as the set

    1. {FV(x) = x}
    2. {FV(\lambda x.t) = FV(t) \textbackslash x}
    3. {FV(t_1 t_2) = FV(t_1) \cup FV(t_2)}.

    Definition 5 Substituiton:

    \displaystyle  (x \mapsto s)x = x \ \ \ \ \ (10)

    \displaystyle  (x \mapsto s)y = y y \neq x \ \ \ \ \ (11)

    \displaystyle  (x \mapsto s)(\lambda y.t) = (\lambda y.(x \mapsto s)t) y \neq x, y \notin FV(s) \ \ \ \ \ (12)

    \displaystyle  (x \mapsto s)(t_1 \thinspace t_2) = ((x \mapsto s)t_1 \thinspace (x \mapsto s)t_2) \ \ \ \ \ (13)

    Syntax:

    \displaystyle  {t} := \thickspace \text{terms} \\ := {x} \thickspace \text{variable} \\ := \lambda {x.t} \thickspace \text{lambda abstraction} \\ := {t \thickspace t} \thickspace \text{application} \\ {v} := \thickspace \text{values} \\ := \lambda{x.t} \thickspace \text{variable} \\ \ \ \ \ \ (14)

    Evaluation:

    \displaystyle  \frac{ t_1 \thickspace \rightarrow \thickspace t'_1 }{ t_1\thinspace t_2 \rightarrow t'_1\thinspace t_2 } \text{E-App1} \\ \frac{ t_2 \rightarrow t'_2 }{ v_1\thinspace t_2 \rightarrow v_1\thinspace t'_2 } \text{E-App2} \\ (\lambda x.t_1)\thinspace v_2 \rightarrow [x \mapsto v_2]t_1 \text{E-AppAbs} \ \ \ \ \ (15)

  • Type Theory – Untyped Arithmetic Expressions

    The following are present in this first language:

    • true and false.
    • conditional expressions.
    • The numeric constant 0.
    • The arithmetic operators succ and prec.
    • A testing operation iszero.

    t = true

    false

    if t then t else t

    0

    succ t

    prec t

    iszero t

    Programs

    A program in the above language is a term built up from the forms given in the grammar.

    if false then 0 else 1

    {\triangleright} 1

    1. Syntax

    The grammar above is a way to specify the syntax of the language. One other way to do so is to specify it inductively using set theory.

    Definition 1 Terms, Inductively: The set of terms is the smallest set { \mathcal{T} } such that

    1. {true, false, 0} {\subseteq \mathcal{T}}.
    2. if {t_1} {\in \mathcal{T}} then succ({t_1}), pred({t_1}), iszero({t_1}) {\subseteq \mathcal{T}}.
    3. if {t_1} {\in \mathcal{T}}, {t_2 \in \mathcal{T}} and {t_3 \in \mathcal{T}} then if {t_1} then {t_2} else {t_3 \in \mathcal{T}}.

    Definition 2 Terms, By Inference Rules: The set of terms is defined by the following rules:

    Rules set 1:

    \displaystyle  true \in \mathcal{T} \ \ \ \ \ (1)

    \displaystyle  false \in \mathcal{T} \ \ \ \ \ (2)

    \displaystyle  0 \in \mathcal{T} \ \ \ \ \ (3)

    Rules set 2:

    \displaystyle  \frac{t_1 \in \mathcal{T}}{succ(t_1) \in \mathcal{T}} \ \ \ \ \ (4)

    \displaystyle  \frac{t_1 \in \mathcal{T}}{pred(t_1) \in \mathcal{T}} \ \ \ \ \ (5)

    \displaystyle  \frac{t_1 \in \mathcal{T}}{iszero(t_1) \in \mathcal{T}} \ \ \ \ \ (6)

    Rules set 3:

    \displaystyle  \frac{ t_1 \in \mathcal{T}, \thickspace t_2 \in \mathcal{T}, \thickspace t_3 \in \mathcal{T} } {t_1 \thickspace then \thickspace t_2 \thickspace else \thickspace t_3 \in \mathcal{T}}. \ \ \ \ \ (7)

    Each rule is read “If we have established the statements in the premise of the rules above, then the we may derive the conclusions in the lines below”. The rules with no premises are called axioms.

    Definition 3 Terms, Concretely: The set of terms can also be given by constructing sets of successively bigger sizes.

    For each natural number {i} we define a set {\mathcal{S}_i} as follows:

    \displaystyle  \mathcal{S}_0 = \emptyset \\ \ \ \ \ \ (8)

    \displaystyle  \mathcal{S}_{i + 1 } = {true, false, 0\} \\ \cup \quad {succ(t_1), \thickspace pred(t_1), \thickspace iszero(t_1) \thickspace | \thickspace t_1 \thickspace \in \mathcal{S}_i} \\ \cup \quad {if \thickspace t_1 \thickspace then \thickspace t_2 \thickspace else \thickspace t_3 }\thickspace | \thickspace t_1, \thickspace t_2, \thickspace t3 \in \mathcal{S}_i}. \ \ \ \ \ (9)

    Finally, let

    \displaystyle  \mathcal{S} = \bigcup_{i=1}^n \mathcal{S}_i \ \ \ \ \ (10)

    Inference rules are just another way of writing the set theoretic definition.

    Each inference rule may imply an infinite tree of terms.

    Proposition 4 {\mathcal{T} = \mathcal{S}}.

    In definition on page we saw that a term can have one of three forms. This can be used in two ways:

    • to give inductive definitions of functions over set of terms.
    • in proving theorems about the properties of terms.

    The consts, size and depth functions can be defined using inductive definition over the set of terms. Relations between sizes and depths, the cardinality of consts and size can be proven using the definition of the set of terms.

    2. Semantic Styles

    2.1. Operational Semantics

    The behaviour of the program is described by specifying an abstract machine for it.

    The machine is abstract in the sense that the terms of the program are what form the machine code, instead of some low level language.

    The terms are represented by states of an abstract machine. The meaning of a program is represented by the output of the computation of the abstract machine whose starting state is the first term of the program, that is the value in the final state.

    It is customary to give two or more different operational semantics for a single language. One closer to the language in question and the other closer to the machine language.

    Proving that the above two operational semantics correspond in some suitable sense amounts to proving that the implementation is correct.

    2.2. Denotational Semantics

    It takes a more abstract view of meaning and instead of defining the meaning of terms in terms of a sequence of machine states, the meaning of term is defined to be some mathematical object – a function or a number.

    3. Evaluation

    Consider a language with grammar as below:

    Syntax:

    \displaystyle  t = \thickspace \text{terms} \\ = true \thickspace \text{constant true} \\ = false \thickspace \text{constant false} \\ = if \thickspace t \thickspace then \thickspace t \thickspace else \thickspace t \thickspace \text{conditional} \\ \text{v} = \thickspace \text{values} \\ = true \thickspace \text{true value} \\ = false \thickspace \text{false value} \ \ \ \ \ (11)

    Values, a subset of terms, are the possible final outcomes of evaluation.

    The evaluation relation is a binary relation on the set of terms.

    Evaluation: t {\rightarrow} t

    \displaystyle  if \thickspace true \thickspace then \thickspace t \thickspace else \thickspace u \thickspace \rightarrow \thickspace t \thickspace \text{E-IfTrue} \\ if \thickspace false \thickspace then \thickspace t \thickspace else \thickspace u \thickspace \rightarrow u \thickspace \text{E-IfFalse} \\ \frac{t \rightarrow u} { if \thickspace t \thickspace then \thickspace v \thickspace else \thickspace w \thickspace \rightarrow \thickspace if \thickspace u \thickspace then \thickspace v \thickspace else \thickspace w } \text{\textsc{E-If}} \ \ \ \ \ (12)

    Definition 5 Evaluation Strategy: The interplay between the rules determines a particular order of evaluation.

    Definition 6 Computation Rules:

    Definition 7 Congruence Rules:

    Definition 8 An instance of an inference rule is obtained by replacing the metavariables in the premises and conclusions of the rule.

    Definition 9 A rule is satisfied by a relation if each instance of the rule the conclusion is in the rule or one of the premises is not.

    Definition 10 A one-step evaluation relation {\rightarrow} is the smallest binary relation on the set of terms which satisfies all the evaluation inference rules for the set of terms.

    Definition 11 When a pair (t, t’) is in the one-step evaluation relation we say that the evaluation statement t {\rightarrow} t’ is derivable .

    Theorem 12 (Determinacy of One-Step Evaluation) If t {\rightarrow} t’ and t {\rightarrow} t”, then t’ {=} t”.

    Definition 13 A term t is in normal form if no evaluation rule applies to it, i.e.\, if there is no t’ such that t {\rightarrow} t’.

    Theorem 14 Every value is in normal form.

    Theorem 15 If t is in normal form, then t is a value.

    Definition 16 The multi-step evaluation relation {\rightarrow^{*}} is the reflexive, transitive closure of the one-step evaluation relation. If t {\rightarrow} t’ and t’ {\rightarrow} t”, then t {\rightarrow} t”.

    Consider a language with grammar as below:

    Syntax:

    \displaystyle  t = \text{terms} \\ = 0 \thickspace \text{constant 0} \\ = succ \thickspace t \thickspace \text{successor} \\ = pred \thickspace t \thickspace \text{predecessor} \\ = iszero \thickspace t \thickspace \text{zero test} \\ = true \thickspace \text{constant true} \\ = false \thickspace \text{constant false} \\ = if \thickspace t \thickspace then \thickspace t \thickspace else \thickspace t \thickspace \text{conditional} \\ \text{v} = \text{values} \\ = true \thickspace \text{true value} \\ = false \thickspace \text{false value} \\ = nv \text{numeric value} \\ \text{nv} = \text{numeric values} \\ = 0 \thickspace \text{0 value} \\ = succ \thickspace nv \thickspace \text{successor value} \ \ \ \ \ (13)

    Values, a subset of terms, are the possible final outcomes of evaluation.

    The evaluation relation is a binary relation on the set of terms.

    Evaluation: t {\rightarrow} t

    \displaystyle  if \thickspace true \thickspace then \thickspace t \thickspace else \thickspace u \thickspace \rightarrow \thickspace t \thickspace \text{E-IfTrue} \\ if \thickspace false \thickspace then \thickspace t \thickspace else \thickspace u \thickspace \rightarrow \thickspace u \thickspace \text{E-IfFalse} \\ \frac{t \thickspace \rightarrow \thickspace u}{if \thickspace t \thickspace then \thickspace v \thickspace else \thickspace w} \rightarrow if \thickspace u \thickspace then \thickspace v \thickspace else \thickspace w \text{\textsc{E-If}} \ \ \ \ \ (14)

    Type Theory – Mathematical Preliminaries

    1. Sets, Relations and Functions

    Definition 1 Countable Set: A set is countable if it can be placed in one-to-one correspondence with the set of natural numbers {\mathbb{N}}.

    Definition 2 n-place Relation: An n-place relation on a collection of sets {\mathcal{S}_1, \mathcal{S}_2, \dotsc, \mathcal{S}_n} is a set {R} of tuples {R \subseteq \mathcal{S}_1 \times \mathcal{S}_2 \times \dotsc \times \mathcal{S}_n}. We say that elements {s_1 \in \mathcal{S}_1} through {s_n \in \mathcal{S}_n} are related by {R} if {(s_1, s_2, \dotsc, s_n) \in R}.

    Definition 3 Predicates: A one place relation is also called a predicate. One-place relation on a set {\mathcal{S}} is a subset {P} of {\mathcal{S}}. It is written as {P(s)} instead of {s \in P}.

    Definition 4 Binary Relation: A two place relation on sets {\mathcal{S}} and {\mathcal{T}} is also called a binary relation on the two sets. It is written as {s} {R} {t} instead of {(s, t) \in R}.

    Three or more place relations are written using the mixfix syntax. The elements are separated by symbols. For example, {\Gamma \dashv \mathsf{s} : \mathsf{T}} stands for the triple {(\Gamma, \mathsf{s}, \mathsf{T})}.

    Definition 5 Domain: The domain of a relation is the set of elements from the first set which are present in the tuples in the relation set.

    Definition 6 Range: The domain of a relation is the set of elements from the first set which are present in the tuples in the relation set.

    Definition 7 Partial Function: A binary relation {R} is a partial function if, whenever {(s, t_1) \in R} and {(s, t_2) \in R}, then, we have {t_1 = t_2 }.

    Definition 8 Total Function: A partial function {R} is also a total function if the domain of R is the whole set.

    Definition 9 A partial function {R} is said to be defined for an element {s \in \mathcal{S}} if {(s, t) \in R} for some {t \in \mathcal{T}}. Otherwise, we write {f(x)\uparrow} or {f(x) = \uparrow} to mean “{f} is undefined on {x}”.

    2. Ordered Sets

    Definition 10 Reflexive Relation: A binary relation {R} on {\mathcal{S}} is said to be reflexive if {(s, s) \in R} for all {s \in \mathcal{S}}.

    Reflexiveness only makes sense for binary relations defined on a single set.

    Definition 11 Symmetric Relation: A binary relation {R} on {\mathcal{S}} is said to be symmetric if {(s, t) \in R \Rightarrow (t, s) \in R} for all {s, t \in \mathcal{S}}.

    Definition 12 Transitive Relation: A binary relation {R} on {\mathcal{S}} is said to be transitive if {(s, t) \text{ and } (t, u) \in R \Rightarrow (s, u) \in R} for all {s, t, u \in \mathcal{S}}.

    Definition 13 Anti-Symmetric Relation: A binary relation {R} on {\mathcal{S}} is said to be anti-symmetric if {(s, t) \in R \text{ and } (t, s) \in R \Rightarrow s = t} for all {s, t \in \mathcal{S}}.

    Definition 14 Preorder: A reflexive and transitive relation on a set is called a preorder on that set. Preorders are usually written using symbols such as {\leq} or {\sqsubseteq}.

    Definition 15 Partial Order: A preorder which is also anti-symmetric is called a partial order.

    Definition 16 Total Order: A partial order with the property that for every two elements {s, t \in \mathcal{S}} we have either {s \leq t} or {s \geq t}.

    Definition 17 Join of two elements: It is the lowest upper bound for the set of elements which are greater than given two elements with respect to the partial order.

    Definition 18 Meet of two elements: It is the greatest lower bound for the set of elements which are smaller than given two elements with respect to the partial order.

    Definition 19 Equivalence Relation: A binary relation on a set is called an equivalence relation if it is reflexive, symmetric and transitive.

    Definition 20 Reflexive Closure: A reflexive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest reflexive relation {R'} which contains {R}. (Smallest in the sense that if some other reflexive relation {R''} also contains {R} then {R' \subset R''}).

    Definition 21 Transitive Closure: A transitive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest transitive relation {R'} which contains {R}. (Smallest in the sense that if some other transitive relation {R''} also contains {R} then {R' \subset R''}). It is written as {R^{+}}.

    Definition 22 Sequence: A transitive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest transitive relation {R'} which contains {R}. (Smallest in the sense that if some other transitive relation {R''} also contains {R} then {R' \subset R''}). It is written as {R^{+}}.

    Definition 23 Permutation: A transitive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest transitive relation {R'} which contains {R}. (Smallest in the sense that if some other transitive relation {R''} also contains {R} then {R' \subset R''}). It is written as {R^{+}}.

    Definition 24 Decreasing Chain: A transitive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest transitive relation {R'} which contains {R}. (Smallest in the sense that if some other transitive relation {R''} also contains {R} then {R' \subset R''}). It is written as {R^{+}}.

    Definition 25 Well Founded Preorder: A transitive closure of a binary relation {R} on a set {\mathcal{S}} is the smallest transitive relation {R'} which contains {R}. (Smallest in the sense that if some other transitive relation {R''} also contains {R} then {R' \subset R''}). It is written as {R^{+}}.

    3. Induction

    Axiom (Principle of Ordinary Induction) Suppose that {P} is a predicate defined on natural numbers. Then:

    If {P(0)}

    and, for all {i}, {P(i)} implies {P(i + 1)},

    then, {P(n)} holds for all {n \in N}.

    Hypothesis Testing and p-values

    1. Introduction

    Hypothesis testing is a method of inference.

    Definition 1 A hypothesis is a statement about a population parameter.

    Definition 2 Null and Alternate Hypothesis: We partition the parameter space {\Theta} into two disjoint sets {\Theta_0} and {\Theta_1} and we wish to test:

    \displaystyle  H_0 \colon \theta \in \Theta_0 \\ H_1 \colon \theta \in \Theta_1. \ \ \ \ \ (1)

    We call {H_0} the null hypothesis and {H_1} the alternate hypothesis.

    Definition 3 Rejection Region: Let {X} be a random variable and let {\mathcal{X}} be its range. Let {R \subset \mathcal{X}} be the rejection region.

    We accept {H_0} when {X} does not belong to {R} and reject when it does.

    Hypothesis testing is a legal trial. We accept {H_0} unless the evidence suggests otherwise. Falsely sentencing the accused when he is not guilty is type I error and letting the accused go free when he is infact guilty is a type II error.

    Definition 4 Power Function of a Test: It is the probability of {X} being in the rejection region, expressed as a function of {\theta}.

    \displaystyle  \beta(\theta) = \mathbb{P}_{\theta}(X \in R) \ \ \ \ \ (2)

    Definition 5 Size of a Test: It is the maximum of the power function of a test when {\theta} is restricted to the {\Theta_0} parameter space.

    \displaystyle  \alpha = \underset{\theta \in \Theta_0}{\text{sup}}(\beta(\theta)). \ \ \ \ \ (3)

    Definition 6 Level {\alpha} Test: A test with size less than or equal to {\alpha} is said to be a level {\alpha} test.

    Definition 7 Simple Hypothesis: A hypothesis of the form {\theta = \theta_0} is called a simple hypothesis.

    Definition 8 Composite Hypothesis: A hypothesis of the form {\theta < \theta_0} or {\theta > \theta_0} is called a composite hypothesis.

    Definition 9 Two Sided Test: A test of the form

    \displaystyle  H_0 \colon \theta = \theta_0 \\ H_1 \colon \theta \neq \theta_0. \ \ \ \ \ (4)

    is called a two-sided test.

    Definition 10 One Sided Test: A test of the form

    \displaystyle  H_0 \colon \theta < \theta_0 \\ H_1 \colon \theta > \theta_0. \ \ \ \ \ (5)

    or

    \displaystyle  H_0 \colon \theta > \theta_0 \\ H_1 \colon \theta < \theta_0. \ \ \ \ \ (6)

    is called a one-sided test.

    Example 1 (Hypothesis Testing on Normal Distribution:) Let {X_1, \dotsc, X_n \sim N(\mu, \sigma^2)} where {\sigma} is known. We want to test {H_0 \colon \mu \leq 0} versus {H_1 \colon \mu > 0}. Hence {\Theta_0 = (- \infty, 0]} and {\Theta_1 = ( 0, \infty)}.

    Consider the test:

    \displaystyle  \text{reject } H_0 \text{ if } T > c \ \ \ \ \ (7)

    where {T = \overline{X}}.

    Then the rejection region is

    \displaystyle  R = \{(x_1, \dotsc, x_n) : \overline{X} > c \} \ \ \ \ \ (8)

    The power function of the test is

    \displaystyle  \beta(\mu) = \mathbb{P}_{\mu}(X \in R) \\ = \mathbb{P}_{\mu}(\overline{X} > c) \\ = \mathbb{P}_{\mu}\left( \frac{\sqrt{n}(\overline{X} - \mu)}{\sigma} > \frac{\sqrt{n}(c - \mu)}{\sigma} \right) \\ = \mathbb{P}_{\mu}\left( Z > \frac{\sqrt{n}(c - \mu)}{\sigma} \right) \\ = 1 - \Phi\left(\frac{\sqrt{n}(c - \mu)}{\sigma}\right). \ \ \ \ \ (9)

    The size of the test is

    \displaystyle  \text{size} = \underset{\mu \leq 0}{\text{sup}}(\beta(\mu)) \\ = \underset{\mu \leq 0}{\text{sup}}\left(1 - \Phi\left(\frac{\sqrt{n}(c - \mu)}{\sigma}\right)\right) \\ = \beta(0) \\ = 1 - \Phi\left(\frac{c\sqrt{n}}{\sigma}\right). \ \ \ \ \ (10)

    Equating with {\alpha} we obtain

    \displaystyle  \alpha = 1 - \Phi\left(\frac{c\sqrt{n}}{\sigma}\right). \ \ \ \ \ (11)

    Hence

    \displaystyle  c = \frac{\sigma \thinspace \Phi^{-1}(1 - \alpha)}{\sqrt{n}}. \ \ \ \ \ (12)

    We reject when {\overline{X} > c}. For a test size of 95%

    \displaystyle  c = \frac{1.96 \sigma}{\sqrt{n}}. \ \ \ \ \ (13)

    or we reject when

    \displaystyle  \overline{X} > \frac{1.96 \sigma}{\sqrt{n}}. \ \ \ \ \ (14)

    2. The Wald Test

    Let {X_1, \dotsc, X_n} be \textsc{iid} random variables with {\theta} as a parameter of their distribution function {F_X(x; \theta)}. Let {\hat{\theta}} be the estimate of {\theta} and let {\widehat{\textsf{se}}} be the standard deviation of the distribution of {\hat{\theta}}.

    Definition 11 (The Wald Test)

    Consider testing a two-sided hypothesis:

    \displaystyle  H_0 \colon \theta = \theta_0 \\ H_1 \colon \theta \neq \theta_0. \ \ \ \ \ (15)

    Assume that {\hat{\theta}} has an asymptotically normal distribution.

    \displaystyle  \frac{\hat { \theta } - \theta }{\widehat{\textsf{se}} } \rightsquigarrow N(0, 1). \ \ \ \ \ (16)

    Then, the size {\alpha} the Wald Test is: reject {H_0} if {|W| > z_{\alpha/2}} where

    \displaystyle  W = \frac{\hat { \theta } - \theta_0 }{\widehat{\textsf{se}}}. \ \ \ \ \ (17)

    Theorem 12 Asymptotically, the Wald test has size {\alpha}.

    \displaystyle  size \\ = \underset{\theta \in \Theta_0}{\text{sup}}(\beta(\theta)) \\ = \underset{\theta \in \Theta_0}{\text{sup}}\mathbb{P}_{\theta}(X \in R) \\ = \mathbb{P}_{\theta_0}(X \in R) \\ = \mathbb{P}_{\theta_0}(|W| > z_{\alpha/2}) \\ = \mathbb{P}_{\theta_0}\left(\left| \frac{\hat { \theta } - \theta_0 }{\widehat{\textsf{se}}}\right| > z_{\alpha/2}\right) \\ \rightarrow \mathbb{P}(|Z| > z_{\alpha/2}) \\ = \alpha. \ \ \ \ \ (18)

    Example 2 Two experiments are conducted to test two prediction algorithms.

    The prediction algorithms are used to predict the outcomes {n} and {m} times, respectively, and have a probability of predicting with success as {p_1} and {p_2}, respectively.

    Let {\delta = p_1 - p_2}.

    Consider testing a two-sided hypothesis:

    \displaystyle  H_0 \colon \delta = 0 \\ H_1 \colon \delta \neq 0. \ \ \ \ \ (19)

    3. The Likelihood Ratio Test

    This test can be used to test vector valued parameters as well.

    Definition 13 The likelihood ratio test statistic for testing {H_0 \colon \theta \in \Theta_0} versus {H_1 \colon \theta \in \Theta_1} is

    \displaystyle  \lambda(x) = \frac{ \underset{\theta \in \Theta_0}{\text{sup}} (L( \theta|\mathbf{x} )) }{ \underset{\theta \in \Theta}{\text{sup}} (L( \theta|\mathbf{x} )) }. \ \ \ \ \ (20)

    Parametric Inference

    There are two methods of estimating {\theta}.

    1. Method of Moments

    It is a method of generating parametric estimators. These estimators are not optimal but they are easy to compute. They are also used to generate starting values for other numerical parametric estimation methods.

    Definition 1 Moments and Sample Moments:

    Suppose that the parameter {\theta} has {k} components: {\theta = (\theta_1,\dotsc,\theta_k)}. For {1 \leq j \leq k},

    Define {j^{th}} moment as

    \displaystyle  \alpha_j \equiv \alpha_j(\theta) = \mathbb{E}_\theta(X^j) = \int \mathrm{x}^{j}\,\mathrm{d}F_{\theta}(x). \ \ \ \ \ (1)

    Define {j^{th}} sample moment as

    \displaystyle  \hat{\alpha_j} = \frac{1}{n}\sum_{i=1}^n X_i^j. \ \ \ \ \ (2)

    Definition 2

    The method of moments estimator {\hat{\theta_n}} is the value of {\theta} which satisfies

    \displaystyle  \alpha_1(\hat{\theta_n}) = \hat{\alpha_1} \\ \alpha_2(\hat{\theta_n}) = \hat{\alpha_2} \\ \vdots \vdots \vdots \\ \alpha_k(\hat{\theta_n}) = \hat{\alpha_k} \ \ \ \ \ (3)

    Why The Above Method Works: The method of moments estimator is obtained by equating the {j^{th}} moment with the {j^{th}} sample moment. Since there are k of them we get k equations in k unknowns (the unknowns are the k parameters). This works because we can find out the {j^{th}} moments in terms of the unknown parameters and we can find the {j^{th}} sample moments numerically, since we know the sample values.

    2. Maximum Likelihood Method

    It is the most common method for estimating parameters in a parametric model.

    Definition 3 Likelihood Function: Let {X_1, \dotsc, X_n} have a \textsc{pdf} { f_X(x;\theta)}. The likelihood function is defined as

    \displaystyle  \mathcal{L}_n(\theta) = \prod_{i=1}^n f_X(X_i;\theta). \ \ \ \ \ (4)

    The log-likelihood fucntion is defined as {\ell_n(\theta) = \log(\mathcal{L}_n(\theta))}.

    The likelihood function is the joint density of the data. We treat it as a function of the parameter {\theta}. Thus {\mathcal{L}_n \colon \Theta \rightarrow [0, \infty)}.

    Definition 4 Maximum Likelihood Estimator: It is the value of {\theta}, {\hat{\theta}} which maximizes the likelihood function.

    Example 1 Let {X_1, \dotsc, X_n} be \textsc{iid} random variables with {\text{Unif}(0, \theta)} distribution.

    \displaystyle  f_X(x; \theta) = \begin{cases} 1/\theta 0 < x < \theta \\ 0 \text{otherwise}. \end{cases} \ \ \ \ \ (5)

    If {X_{max} = \max\{X_1, \dotsc, X_n \}} and {X_{max} > \theta}, then {\mathcal{L}_n(\theta) = 0}. Otherwise {\mathcal{L}_n(\theta) = (\frac{1}{\theta})^n } which is a decreasing function of {\theta}. Hence {\hat{\theta} = \max\{ \mathcal{L}_n(\theta)\} = X_{max}}.

    3. Properties of MLE

  • Consistent: MLE is consistent: The estimate converges to the true value in probability.
  • Equivariant: MLE is equivariant. Functions of estimate are estimator of functions of true parameters.
  • Asymptotically Normal: MLE is asymptotically normal.
  • Asymptotically Optimal It has the smallest variance among all other well behaved estimators.
  • Bayes Estimator: MLE is also the Bayes Estimator.

    4. Consistency of MLE

    Definition 5 Kullback Leibler Distance:
    If {f} and {g} are \textsc{pdf}, the Kullback Leibler distance between them is defined as

    \displaystyle  D(f, g) = \int f(x) \log \left( \frac{f(x) }{g(x) } \right) dx. \ \ \ \ \ (6)

    5. Equivariance of MLE

    6. Asymptotic Normality of MLE

    The distribution of {\hat{\theta}} is asymptotically normal. We need the following definitions to prove it.

    Definition 6 Score Function: Let {X} be a random variable with \textsc{pdf} {f_X(x; \theta)}. Then the score function is defined as

    \displaystyle  s(X; \theta) = \frac{\partial \log f_X(x; \theta) }{\partial \theta}. \ \ \ \ \ (7)

    Definition 7 Fisher Information: The Fisher Information is defined as

    \displaystyle  I_n(\theta) = \mathbb{V}_{\theta}\left( \sum_{i=1}^n s(X_i; \theta) \right) \\ = \sum_{i=1}^n \mathbb{V}_{\theta}\left(s(X_i; \theta) \right). \ \ \ \ \ (8)

    Theorem 8

    \displaystyle  \mathbb{E}_{\theta}(s(X; \theta)) = 0. \ \ \ \ \ (9)

    Theorem 9

    \displaystyle  \mathbb{V}_{\theta}(s(X; \theta)) = \mathbb{E}_{\theta}(s^2(X; \theta)). \ \ \ \ \ (10)

    Theorem 10

    \displaystyle  I_n(\theta) = nI(\theta). \ \ \ \ \ (11)

    Theorem 11

    \displaystyle  I(\theta) = -\mathbb{E}_{\theta}\left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right) \\ = -\int \left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right)f_X(x; \theta) dx. \ \ \ \ \ (12)

    Definition 12 Let { \textsf{se} = \sqrt{\mathbb{V}(\hat{\theta} ) } }.

    Theorem 13

    \displaystyle  \textsf{se} \approx \sqrt{1/I_n(\theta)}. \ \ \ \ \ (13)

    Theorem 14

    \displaystyle  \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (14)

    Theorem 15 Let { \hat{\textsf{se}} = \sqrt{1/I_n(\hat{\theta)}} }.

    \displaystyle  \frac{\hat { \theta } - \theta }{\hat{\textsf{se}}} \rightsquigarrow N(0, 1). \ \ \ \ \ (15)

    In words, {(a, b)} traps {\theta} with probability {1 - \alpha}. We call {1 - \alpha} the coverage of the confidence interval. {C_n} is random and {\theta} is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing {\alpha} = 0.05. If {\theta} is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

    Theorem 16 (Normal Based Confidence Intervals)

    Let {\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}.

    and let {\Phi} be the \textsc{cdf} of a random variable {Z} with standard normal distribution and

    \displaystyle  z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (16)

    and let

    \displaystyle  C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (17)

    Then,

    \displaystyle  \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (18)

    For 95% confidence intervals { 1 - \alpha} is .95, {\alpha} is .05, {z _{\alpha / 2}} is 1.96 and the interval is thus { C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) } .

  • Introduction to Statistical Inference

    1. Introduction

    We assume that the data we are looking at comes from a probability distribution with some unknown parameters that control the exact shape of the distribution.

    Definition 1 Statistical Inference: It is the process of using given data to infer the properties of the distribution (for example the values of the parameters) which generated the data. It is also called ‘learning’ in computer science.

    Definition 2 Statistical Models: A statistical model is a set of distributions.

    When we find out the form of the distribution (the equations that describe it) and the parameters used in the form we gain more understanding of the source of our data.

    2. Parametric Models

    Definition 3 Parametric Models: A parametric model is a statistical model which is parameterized by a finite number of parameters. A general form of a parametric model is

    \displaystyle  \mathfrak{F} = \{f(x;\theta) : \theta \in \Theta\} \ \ \ \ \ (1)

    where {\theta} is an unknown parameter (or vector of parameters) that can take values in the parameter space {\Theta}.

    Example 1 An example of a parametric model is:

    \displaystyle  \mathfrak{F} = \{f(x;\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}, \mu \in \mathbb{R}, \sigma > 0\} \ \ \ \ \ (2)

    3. Non-Parametric Models

    Definition 4 Non – Parametric Models: A non-parametric model is one in which {\mathfrak{F}_{ALL} = \{\text{all CDF's}\}} cannot be parameterized by a finite number of parameters.

    3.1. Non-Paramateric Estimation of Functionals

    Definition 5 Sobolev Space: Usually, it is not possible to estimate the probability distribution from the data by just assuming that it exists. We need to restrict the space of possible solutions. One way is to assume that the density function is a smooth function. The restricted space is called Sobolev Space.

    Definition 6 Statistical Functional: Any function of \textsc{cdf} {F} is called a statistical functional.

    Example 2 Statistical Functionals: The mean, variance and median can be thought of as functions of {F}:

    The mean {\mu} is given as:

    \displaystyle  \mu = T(F) = \int x dF(x) \ \ \ \ \ (3)

    The variance is given as:

    \displaystyle  T(F) = \int x^2 dF(x) - \left(\int xdF(x)\right)^2 \ \ \ \ \ (4)

    The median is given as:

    \displaystyle  T(F) = F^{-1}(x) \ \ \ \ \ (5)

    4. Regression

    Definition 7 Independent and Dependent Variables: We observe pairs of data: {(X_1, Y_1),\dotsc,(X_n, Y_n)}. {Y} is assumed to depend on {X} which is assumed to be the independent variable. The other names for these are, for:

  • {X}: predictor, regressor, feature or independent variable.
  • {Y}: response variable, outcome or dependent variable.
  • Definition 8 Regression Function: The regression function is

    \displaystyle  r(X) = \mathbb{E}(Y|X=x) \ \ \ \ \ (6)

    Definition 9 Parametric and Non-Parametric Regression Models: If we assume that {r \in \mathfrak{F} \text{ and } \mathfrak{F}} is finite dimensional, then the model is a parametric regression model, otherwise it is a non-parametric regression model.

    There can be three categories of regression, based on the purpose for which it was done:

    • Prediction,
    • Classification and
    • Curve Estimation

    Definition 10 Prediction: The goal of predicting {Y} based on the value of {X} is called prediction.

    Definition 11 Classification: If {Y} is discrete then prediction is instead called classification.

    Definition 12 Curve Estimation: If our goal is to estimate the function {r}, then we call this regression or curve estimation.

    The regression function {r(X) = \mathbb{E}(Y|X=x)} can be algebraically manipulated to express it in the form

    \displaystyle  Y = r(X) + \epsilon \ \ \ \ \ (7)

    where {\mathbb{E}(\epsilon) = 0}.

    If {\mathfrak{F} = \{f(x;\theta) : \theta \in \Theta\}} is a parametric model, then we write {P_{\theta}(X \in A) = \int_A f_X(x) dx } to denote the probability that X belongs to A. It does not mean that we are averaging over {\theta}, it means that the probability is calculated assuming the parameter is {\theta}.

    5. Fundamental Concepts in Inference

    Many inferential problems can be identified as being one of three types: estimation, confidence sets, or hypothesis testing.

    5.1. Point Estimates

    Definition 13 Point Estimation: Point estimation refers to providing a single “best guess” of some quantity of interest. The quantity of interest could be

    • a parameter in a parametric model,
    • a \textsc{cdf} {F},
    • a probability density function {f},
    • a regression function {r}, or
    • a prediction for a future value {Y} of some random variable.

    By convention, we denote a point estimate of {\theta \text{ by } \hat{\theta}}. Since {\theta} is a fixed, unknown quantity, the estimate {\hat{\theta}} depends on the data so {\hat{\theta}} is a random.

    Definition 14 Point Estimator of {\hat{\theta}_n}: Formally, let {X_1, \dotsc, X_n} be n \textsc{iid} data points from some distribution {F}. Then, a point estimator {\hat { \theta } } of {\theta} is some function of {X_1, \dotsc, X_n}:

    \displaystyle  \hat { \theta } = g( X_1, \dotsc, X_n ). \ \ \ \ \ (8)

    Definition 15 Bias of an Estimator: The bias of an estimator is defined as:

    \displaystyle  \textsf{bias}(\hat{\theta}) = \mathbb{E}( \hat{\theta} ) - \theta. \ \ \ \ \ (9)

    Definition 16 Consistent Estimator: A point estimator {\hat { \theta } } of {\theta} is consistent if: {\hat{\theta} \xrightarrow{P} \theta}.

    Definition 17 Sampling Distribution: The distribution of {\hat{\theta}} is called sampling distribution.

    Definition 18 Standard Error: The standard deviation of the sampling distribution is called standard error denoted by \textsf{se}.

    \displaystyle  \textsf{se} = \textsf{se}(\hat{\theta}) = \sqrt{\mathbb{V}(\hat{\theta})}. \ \ \ \ \ (10)

    In some cases, \textsf{se} depends upon the unkown distribution {F}. Its estimate is denoted by {\widehat{\textsf{se}}}.

    Definition 19 Mean Squared Error: It is used to evaluate the quality of a point estimator. It is defined as

    \displaystyle  \textsf{\textsc{mse}}(\hat{\theta}) = \mathbb{E}_{ \theta } ( \hat{\theta} - \theta)^2. \ \ \ \ \ (11)

    Example 3 Let {X_1, \dotsc, X_n} be \textsc{iid} random variables with Bernoulli distribution. Then {\hat { p } = \frac{1}{n} \sum_{i = 1}^nX_i }. Then, {\mathbb{E}( \hat { p }) = p}. Hence, {\hat { p }} is unbiased.

    Definition 20 Asymptotically Normal Estimator: An estimator is asymptotically normal if

    \displaystyle  \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (12)

    5.2. Confidence Sets

    Definition 21 A {1 - \alpha} confidence interval for a parameter {\theta} is an interval {C_n = (a, b)} (where {a = a(X_1,\dotsc, X_n ) } and {b = b(X_1,\dotsc, X_n ) } are functions of the data), such that

    \displaystyle  \mathbb{P}_\theta(\theta \in C_n) \geq 1 - \alpha, \forall \: \theta \in \Theta. \ \ \ \ \ (13)

    In words, {(a, b)} traps {\theta} with probability {1 - \alpha}. We call {1 - \alpha} the coverage of the confidence interval. {C_n} is random and {\theta} is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing {\alpha} = 0.05. If {\theta} is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

    Theorem 22 (Normal Based Confidence Intervals)

    Let {\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}.

    and let {\Phi} be the \textsc{cdf} of a random variable {Z} with standard normal distribution and

    \displaystyle  z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (14)

    and let

    \displaystyle  C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (15)

    Then,

    \displaystyle  \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (16)

    For 95% confidence intervals { 1 - \alpha} is .95, {\alpha} is .05, {z _{\alpha / 2}} is 1.96 and the interval is thus { C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) } .

    5.3. Hypothesis Testing

    In hypothesis testing, we start with some default theory – called a null hypothesis – and we ask if the data provide sufficient evidence to reject the theory. If not we retain the null hypothesis.

    Convergence of Random Variables

    1. Introduction

    There are two main ideas in this article.

  • Law of Large Numbers:
    This states that the mean of the sample {\overline{X}_n} converges in probability to the distribution mean {\mu} as {n} increases.

  • Central Limit Theorem:
    This states that the distribution of the sample mean converges in distribution to a normal distribution as {n} increases.

    2. Types of Convergence

    Let {X_1, \dotsc, X_n} be a sequence of random variables with distributions {F_n} and let {X} be a random variable with distribution {F}.

    Definition 1 Convergence in Probability: {X_n} converges to {X} in probability, written as {X_n \overset{P}{\longrightarrow} X}, if for every {\epsilon > 0}, we have

    \displaystyle  \mathbb{P}(|X_n - X| > \epsilon) \rightarrow 0 \ \ \ \ \ (1)

    as {n \rightarrow \infty}.

    Definition 2 Convergence in Distribution: {X_n} converges to {X} in distribution, written as {X_n \rightsquigarrow X}, if

    \displaystyle  \underset{n \rightarrow \infty}{\text{lim}} F_n(t) = F(t) \ \ \ \ \ (2)

    at all {t} for which {F} is continuous.

    3. The Law of Large Numbers

    Let {X_1, X_2, \dotsc} be \textsc{iid} with mean {\mu = \mathbb{E}(X_1)} and variance {\sigma^2 = \mathbb{V}(X_1)}. Let sample mean be defined as {\overline{X}_n = (1/n)\sum_{i=1}^n X_i}. It can be shown that {\mathbb{E}(\overline{X}_n) = \mu} and {\mathbb{V}(\overline{X}_n) = \sigma^2/n}.

    Theorem 3 Weak Law of Large Numbers: If {X_1, \dotsc, X_n} are \textsc{iid} random variables, then {\overline{X}_n \overset{P}{\longrightarrow} \mu}.

    4. The Central Limit Theorem

    The law of large numbers says that the distribution of the sample mean, {\overline{X}_n}, piles up near the true distribution mean, {\mu}. The central limit theorem further adds that the distribution of the sample mean approaches a Normal distribution as n gets large. It even gives the mean and the variance of the normal distribution.

    Theorem 4 The Central Limit Theorem: Let {X_1, \dotsc, X_n} be \textsc{iid} random variables with mean {\mu} and standard deviation {\sigma}. Let sample mean be defined as {\overline{X}_n = (1/n)\sum_{i=1}^n X_i}. Then the asymptotic behaviour of the distribution of the sample mean is given by

    \displaystyle  Z_n = \frac{ (\overline{X}_n - \mu) } { \sqrt{ \mathbb{V}( \overline{X}_n ) } } = \frac{ \sqrt{n}(\overline{X}_n - \mu) } { \sigma } \rightsquigarrow N(0,1) \ \ \ \ \ (3)

    The other ways of expressing the above equation are

    \displaystyle  Z_n \approx N(0, 1) \\ \overline{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right) \\ (\overline{X}_n - \mu) \approx N(0, \sigma^2/n) \\ \sqrt{n}(\overline{X}_n - \mu) \approx N(0,\sigma^2) \\ \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma} \approx N(0,1). \ \ \ \ \ (4)

    Definition 5 As has been defined in the Expectation chapter, if {X_1, \dotsc, X_n} are random variables, then we define the sample variance as

    \displaystyle  S_n^2 = \frac{1}{n - 1}\left(\sum_{i=1}^n (\overline{X}_n - X_i)^2\right). \ \ \ \ \ (5)

    Theorem 6 Assuming the conditions of the CLT,

    \displaystyle  \frac{\sqrt{n}(\overline{X}_n - \mu)}{S_n} \approx N(0,1). \ \ \ \ \ (6)

    5. The Delta Method

    Theorem 7 Let {Y_n} be a random variable with conditions of CLT met, and let {g(x)} be a differentiable function with {g'(\mu) \neq 0}. Then,

    \displaystyle  Y_n \approx N\left(\mu, \frac{\sigma^2}{n}\right) \\ \Longrightarrow \quad g(Y_n) \approx N\left(g(\mu), (g'(\mu))^2\frac{\sigma^2}{n}\right). \ \ \ \ \ (7)