Uncategorised – Gaurav Sharma's Blog

Type Theory – The Untyped Lambda Calculus

The Untyped Lambda Calculus or the pure lambda calculus forms the computational substrate for most of the type systems.

Peter Landin observed that complex programming languages can be understood as having a tiny core with all the complex features implemented in the tiny core.

Lambda Calculus can be viewed as a simple programming language or as a mathematical object about which statements can be proved.

Everything is a function.

Definition 1 Lambda Term: It refers to any term in the lambda calculus.

Definition 2 Lambda Abstraction: Lambda terms beginning with a ${\lambda}$ are called lambda abstractions.

There are three kind of terms in the lambda calculus.

A variable {x} itself is a term.
The abstraction of a variable from a term written as ${\lambda{x.t}}$ is a term called a lambda abstraction.
The application of a term to another term is written as {t t}.

The grammar can be summarized as:

Syntax:

$\displaystyle {t} := \text{terms} \\ := {x} \thickspace \text{variable} \\ := \lambda {x.t} \thickspace \text{lambda abstraction} \\ := {t \thickspace t} \thickspace \text{application} \ \ \ \ \ (1)$

To save writing too many parenthesis the convention is that the

application associates to the left, that is, is {s t u} is {(s t) u}.
abstraction extend as far to the right as possible. For example, ${\lambda{x.} \lambda{y.x \thickspace y \thickspace x}}$ is ${\lambda{x.(} \lambda{y.(x \thickspace y) \thickspace x)}}$ .

1. Scope

A variable is said to be bound when it occurs in the body {t} of an abstraction ${\lambda{x.t}}$ .

A variable is said to be free if it is not bound by enclosing abstraction on {x}

An abstraction with no free variables is said to be closed or a combinator.

2. Operational Semantics

In its pure form, lambda calculus has no mechanism for computation except function application, functions are applied to arguments which are themselves functions. Each step in computation is rewriting an application by replacing the value of the bound variable of the lambda abstraction with the argument given on the right side of the application.

$\displaystyle (\lambda x.{t}_{12}){t}_2 \rightarrow [{x} \thickspace \rightarrow \thickspace {t}_2] {t}_{12} \ \ \ \ \ (2)$

There can be many different evaluation strategies.

Full-beta Reduction: Any redex may be reduced at any time.

$\displaystyle (\lambda x.x)((\lambda x.x)(\lambda z.(\lambda x.x) \thickspace z)) \ \ \ \ \ (3)$

Normal Order Strategy: The outermost redex is reduced first. It is a partial function and each evaluation gives a unique term.

Call by Name: In addition to the outermost redex being reduced first, no evaluation is allowed inside abstractions.

Call by Need: A reduction relation on abstract syntax graph, instead of abstract syntax trees.

Call by Value:

3. Multiple Arguments

$\displaystyle \lambda (x \thickspace y).s = \lambda x. \thickspace \lambda y. \thickspace s \ \ \ \ \ (4)$

4. Church Booleans

$\displaystyle {tru} \thickspace = \lambda {t.} \thickspace \lambda {f.} {t} \\ {fls} = \lambda {t.} \lambda {f.} {f} \\ {test} = \lambda {l.} \lambda {m.} \lambda {n.} {l m n} \\ {test \thickspace b \thickspace v \thickspace u} = {b \thickspace v \thickspace u} \\ {and} = \lambda {b.} \lambda {c.} {b \thickspace c \thickspace fls} \ \ \ \ \ (5)$

5. Pairs

$\displaystyle {pair} = \lambda {f.} \lambda {s.} \lambda {b.} {b \thickspace f \thickspace s} \\ {fst} = \lambda {p.} {p \thickspace tru} \\ {snd} = \lambda {p.} {p \thickspace fls} \\ {fst \thickspace (pair \thickspace v \thickspace w)} = (\lambda {p.} {p \thickspace tru})((\lambda {f.} \lambda {s.} \lambda {b.} {b \thickspace f \thickspace s}) { v \thickspace w}) \\ = (\lambda {p.} {p \thickspace tru})((\lambda {s.} \lambda {b.} {b \thickspace v \thickspace s}) { w}) \\ = (\lambda {p.} {p \thickspace tru})(\lambda {b.} {b \thickspace v \thickspace w}) \\ = (\lambda {b.} {b v w}) { tru} \\ = ({tru \thickspace v \thickspace w}) \\ = {v} \ \ \ \ \ (6)$

6. Church Numerals

$\displaystyle {c}_0 = \lambda {s.} \lambda {z.} {z} \\ {c}_1 = \lambda {s.} \lambda {z.} {s \thickspace z} \\ {c}_2 = \lambda {s.} \lambda {z.} {s \thickspace (s \thickspace z)} \\ {c}_3 = \lambda {s.} \lambda {z.} {s \thickspace (s \thickspace (s \thickspace z))} \\ {scc} = \lambda {n.} \lambda {s.} \lambda {z.} {s \thickspace (n \thickspace s \thickspace z)}. \ \ \ \ \ (7)$

A number n is encoded by the action of applying a function s to z n times.

7. Recursion

$\displaystyle {omega} = (\lambda {x.} {x \thickspace x})(\lambda {x.} {x \thickspace x}) \\ {fix} = \lambda {f.} ( \lambda {x.} {f} ( \lambda {y.} {x \thickspace x \thickspace y} ) ) ( \lambda {x.} {f} ( \lambda {y.} {x \thickspace x \thickspace y} ) ) \ \ \ \ \ (8)$

$\displaystyle {factorial} = {body \thickspace containing \thickspace factorial} \\ {g} = \lambda {fct.body \thickspace containing \thickspace fct} \\ {g} = \lambda {fct.(}\lambda {n.if \thickspace (cnd) \thickspace then \thickspace (c1) \thickspace else \thickspace (times \thickspace n \thickspace (fct \thickspace (prd \thickspace n))))} \\ {factorial} = {fix \thickspace g} \\ {h1} = \lambda {x.g(}\lambda {y.x \thickspace x \thickspace y)} \\ {fct} = \lambda {y.h1 \thickspace h1 \thickspace y} \\ {factorial c_3} = {fix \thickspace g \thickspace c}_3 \\ = {h1 \thickspace h1 \thickspace g \thickspace c_3 } \ \ \ \ \ (9)$

factorial = fix (lambda fct. body containing fct)

8. Formalities

Definition 3 The set of terms is the smallest set such that

If ${x \in \mathcal{V}}$ , then ${x \in \mathcal{T}}$ .
If elements ${x, t \in \mathcal{T}}$ , then ${\lambda {x.t} \in \mathcal{T}}$ .
If ${{u, v} \in \mathcal{T}}$ , then ${{u v} \in \mathcal{T}}$ .

Definition 4 The set of free variables of a term {t}, written as ${FV(t)}$ is defined as the set

${FV(x) = x}$
${FV(\lambda x.t) = FV(t) \textbackslash x}$
${FV(t_1 t_2) = FV(t_1) \cup FV(t_2)}$ .

Definition 5 Substituiton:

$\displaystyle (x \mapsto s)x = x \ \ \ \ \ (10)$

$\displaystyle (x \mapsto s)y = y y \neq x \ \ \ \ \ (11)$

$\displaystyle (x \mapsto s)(\lambda y.t) = (\lambda y.(x \mapsto s)t) y \neq x, y \notin FV(s) \ \ \ \ \ (12)$

$\displaystyle (x \mapsto s)(t_1 \thinspace t_2) = ((x \mapsto s)t_1 \thinspace (x \mapsto s)t_2) \ \ \ \ \ (13)$

Syntax:

$\displaystyle {t} := \thickspace \text{terms} \\ := {x} \thickspace \text{variable} \\ := \lambda {x.t} \thickspace \text{lambda abstraction} \\ := {t \thickspace t} \thickspace \text{application} \\ {v} := \thickspace \text{values} \\ := \lambda{x.t} \thickspace \text{variable} \\ \ \ \ \ \ (14)$

Evaluation:

$\displaystyle \frac{ t_1 \thickspace \rightarrow \thickspace t'_1 }{ t_1\thinspace t_2 \rightarrow t'_1\thinspace t_2 } \text{E-App1} \\ \frac{ t_2 \rightarrow t'_2 }{ v_1\thinspace t_2 \rightarrow v_1\thinspace t'_2 } \text{E-App2} \\ (\lambda x.t_1)\thinspace v_2 \rightarrow [x \mapsto v_2]t_1 \text{E-AppAbs} \ \ \ \ \ (15)$

Type Theory – Mathematical Preliminaries

1. Sets, Relations and Functions

Definition 1 Countable Set: A set is countable if it can be placed in one-to-one correspondence with the set of natural numbers ${\mathbb{N}}$ .

Definition 2 n-place Relation: An n-place relation on a collection of sets ${\mathcal{S}_1, \mathcal{S}_2, \dotsc, \mathcal{S}_n}$ is a set ${R}$ of tuples ${R \subseteq \mathcal{S}_1 \times \mathcal{S}_2 \times \dotsc \times \mathcal{S}_n}$ . We say that elements ${s_1 \in \mathcal{S}_1}$ through ${s_n \in \mathcal{S}_n}$ are related by ${R}$ if ${(s_1, s_2, \dotsc, s_n) \in R}$ .

Definition 3 Predicates: A one place relation is also called a predicate. One-place relation on a set ${\mathcal{S}}$ is a subset ${P}$ of ${\mathcal{S}}$ . It is written as ${P(s)}$ instead of ${s \in P}$ .

Definition 4 Binary Relation: A two place relation on sets ${\mathcal{S}}$ and ${\mathcal{T}}$ is also called a binary relation on the two sets. It is written as ${s}$ ${R}$ ${t}$ instead of ${(s, t) \in R}$ .

Three or more place relations are written using the mixfix syntax. The elements are separated by symbols. For example, ${\Gamma \dashv \mathsf{s} : \mathsf{T}}$ stands for the triple ${(\Gamma, \mathsf{s}, \mathsf{T})}$ .

Definition 5 Domain: The domain of a relation is the set of elements from the first set which are present in the tuples in the relation set.

Definition 6 Range: The domain of a relation is the set of elements from the first set which are present in the tuples in the relation set.

Definition 7 Partial Function: A binary relation ${R}$ is a partial function if, whenever ${(s, t_1) \in R}$ and ${(s, t_2) \in R}$ , then, we have ${t_1 = t_2 }$ .

Definition 8 Total Function: A partial function ${R}$ is also a total function if the domain of R is the whole set.

Definition 9 A partial function ${R}$ is said to be defined for an element ${s \in \mathcal{S}}$ if ${(s, t) \in R}$ for some ${t \in \mathcal{T}}$ . Otherwise, we write ${f(x)\uparrow}$ or ${f(x) = \uparrow}$ to mean “ ${f}$ is undefined on ${x}$ ”.

2. Ordered Sets

Definition 10 Reflexive Relation: A binary relation ${R}$ on ${\mathcal{S}}$ is said to be reflexive if ${(s, s) \in R}$ for all ${s \in \mathcal{S}}$ .

Reflexiveness only makes sense for binary relations defined on a single set.

Definition 11 Symmetric Relation: A binary relation ${R}$ on ${\mathcal{S}}$ is said to be symmetric if ${(s, t) \in R \Rightarrow (t, s) \in R}$ for all ${s, t \in \mathcal{S}}$ .

Definition 12 Transitive Relation: A binary relation ${R}$ on ${\mathcal{S}}$ is said to be transitive if ${(s, t) \text{ and } (t, u) \in R \Rightarrow (s, u) \in R}$ for all ${s, t, u \in \mathcal{S}}$ .

Definition 13 Anti-Symmetric Relation: A binary relation ${R}$ on ${\mathcal{S}}$ is said to be anti-symmetric if ${(s, t) \in R \text{ and } (t, s) \in R \Rightarrow s = t}$ for all ${s, t \in \mathcal{S}}$ .

Definition 14 Preorder: A reflexive and transitive relation on a set is called a preorder on that set. Preorders are usually written using symbols such as ${\leq}$ or ${\sqsubseteq}$ .

Definition 15 Partial Order: A preorder which is also anti-symmetric is called a partial order.

Definition 16 Total Order: A partial order with the property that for every two elements ${s, t \in \mathcal{S}}$ we have either ${s \leq t}$ or ${s \geq t}$ .

Definition 17 Join of two elements: It is the lowest upper bound for the set of elements which are greater than given two elements with respect to the partial order.

Definition 18 Meet of two elements: It is the greatest lower bound for the set of elements which are smaller than given two elements with respect to the partial order.

Definition 19 Equivalence Relation: A binary relation on a set is called an equivalence relation if it is reflexive, symmetric and transitive.

Definition 20 Reflexive Closure: A reflexive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest reflexive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other reflexive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ).

Definition 21 Transitive Closure: A transitive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest transitive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other transitive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ). It is written as ${R^{+}}$ .

Definition 22 Sequence: A transitive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest transitive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other transitive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ). It is written as ${R^{+}}$ .

Definition 23 Permutation: A transitive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest transitive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other transitive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ). It is written as ${R^{+}}$ .

Definition 24 Decreasing Chain: A transitive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest transitive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other transitive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ). It is written as ${R^{+}}$ .

Definition 25 Well Founded Preorder: A transitive closure of a binary relation ${R}$ on a set ${\mathcal{S}}$ is the smallest transitive relation ${R'}$ which contains ${R}$ . (Smallest in the sense that if some other transitive relation ${R''}$ also contains ${R}$ then ${R' \subset R''}$ ). It is written as ${R^{+}}$ .

3. Induction

Axiom (Principle of Ordinary Induction) Suppose that ${P}$ is a predicate defined on natural numbers. Then:

If ${P(0)}$

and, for all ${i}$ , ${P(i)}$ implies ${P(i + 1)}$ ,

then, ${P(n)}$ holds for all ${n \in N}$ .

Hypothesis Testing and p-values

1. Introduction

Hypothesis testing is a method of inference.

Definition 1 A hypothesis is a statement about a population parameter.

Definition 2 Null and Alternate Hypothesis: We partition the parameter space ${\Theta}$ into two disjoint sets ${\Theta_0}$ and ${\Theta_1}$ and we wish to test:

$\displaystyle H_0 \colon \theta \in \Theta_0 \\ H_1 \colon \theta \in \Theta_1. \ \ \ \ \ (1)$

We call ${H_0}$ the null hypothesis and ${H_1}$ the alternate hypothesis.

Definition 3 Rejection Region: Let ${X}$ be a random variable and let ${\mathcal{X}}$ be its range. Let ${R \subset \mathcal{X}}$ be the rejection region.

We accept ${H_0}$ when ${X}$ does not belong to ${R}$ and reject when it does.

Hypothesis testing is a legal trial. We accept ${H_0}$ unless the evidence suggests otherwise. Falsely sentencing the accused when he is not guilty is type I error and letting the accused go free when he is infact guilty is a type II error.

Definition 4 Power Function of a Test: It is the probability of ${X}$ being in the rejection region, expressed as a function of ${\theta}$ .

$\displaystyle \beta(\theta) = \mathbb{P}_{\theta}(X \in R) \ \ \ \ \ (2)$

Definition 5 Size of a Test: It is the maximum of the power function of a test when ${\theta}$ is restricted to the ${\Theta_0}$ parameter space.

$\displaystyle \alpha = \underset{\theta \in \Theta_0}{\text{sup}}(\beta(\theta)). \ \ \ \ \ (3)$

Definition 6 Level ${\alpha}$ Test: A test with size less than or equal to ${\alpha}$ is said to be a level ${\alpha}$ test.

Definition 7 Simple Hypothesis: A hypothesis of the form ${\theta = \theta_0}$ is called a simple hypothesis.

Definition 8 Composite Hypothesis: A hypothesis of the form ${\theta < \theta_0}$ or ${\theta > \theta_0}$ is called a composite hypothesis.

Definition 9 Two Sided Test: A test of the form

$\displaystyle H_0 \colon \theta = \theta_0 \\ H_1 \colon \theta \neq \theta_0. \ \ \ \ \ (4)$

is called a two-sided test.

Definition 10 One Sided Test: A test of the form

$\displaystyle H_0 \colon \theta < \theta_0 \\ H_1 \colon \theta > \theta_0. \ \ \ \ \ (5)$

or

$\displaystyle H_0 \colon \theta > \theta_0 \\ H_1 \colon \theta < \theta_0. \ \ \ \ \ (6)$

is called a one-sided test.

Example 1 (Hypothesis Testing on Normal Distribution:) Let ${X_1, \dotsc, X_n \sim N(\mu, \sigma^2)}$ where ${\sigma}$ is known. We want to test ${H_0 \colon \mu \leq 0}$ versus ${H_1 \colon \mu > 0}$ . Hence ${\Theta_0 = (- \infty, 0]}$ and ${\Theta_1 = ( 0, \infty)}$ .

Consider the test:

$\displaystyle \text{reject } H_0 \text{ if } T > c \ \ \ \ \ (7)$

where ${T = \overline{X}}$ .

Then the rejection region is

$\displaystyle R = \{(x_1, \dotsc, x_n) : \overline{X} > c \} \ \ \ \ \ (8)$

The power function of the test is

$\displaystyle \beta(\mu) = \mathbb{P}_{\mu}(X \in R) \\ = \mathbb{P}_{\mu}(\overline{X} > c) \\ = \mathbb{P}_{\mu}\left( \frac{\sqrt{n}(\overline{X} - \mu)}{\sigma} > \frac{\sqrt{n}(c - \mu)}{\sigma} \right) \\ = \mathbb{P}_{\mu}\left( Z > \frac{\sqrt{n}(c - \mu)}{\sigma} \right) \\ = 1 - \Phi\left(\frac{\sqrt{n}(c - \mu)}{\sigma}\right). \ \ \ \ \ (9)$

The size of the test is

$\displaystyle \text{size} = \underset{\mu \leq 0}{\text{sup}}(\beta(\mu)) \\ = \underset{\mu \leq 0}{\text{sup}}\left(1 - \Phi\left(\frac{\sqrt{n}(c - \mu)}{\sigma}\right)\right) \\ = \beta(0) \\ = 1 - \Phi\left(\frac{c\sqrt{n}}{\sigma}\right). \ \ \ \ \ (10)$

Equating with ${\alpha}$ we obtain

$\displaystyle \alpha = 1 - \Phi\left(\frac{c\sqrt{n}}{\sigma}\right). \ \ \ \ \ (11)$

Hence

$\displaystyle c = \frac{\sigma \thinspace \Phi^{-1}(1 - \alpha)}{\sqrt{n}}. \ \ \ \ \ (12)$

We reject when ${\overline{X} > c}$ . For a test size of 95%

$\displaystyle c = \frac{1.96 \sigma}{\sqrt{n}}. \ \ \ \ \ (13)$

or we reject when

$\displaystyle \overline{X} > \frac{1.96 \sigma}{\sqrt{n}}. \ \ \ \ \ (14)$

2. The Wald Test

Let ${X_1, \dotsc, X_n}$ be \textsc{iid} random variables with ${\theta}$ as a parameter of their distribution function ${F_X(x; \theta)}$ . Let ${\hat{\theta}}$ be the estimate of ${\theta}$ and let ${\widehat{\textsf{se}}}$ be the standard deviation of the distribution of ${\hat{\theta}}$ .

Definition 11 (The Wald Test)

Consider testing a two-sided hypothesis:

$\displaystyle H_0 \colon \theta = \theta_0 \\ H_1 \colon \theta \neq \theta_0. \ \ \ \ \ (15)$

Assume that ${\hat{\theta}}$ has an asymptotically normal distribution.

$\displaystyle \frac{\hat { \theta } - \theta }{\widehat{\textsf{se}} } \rightsquigarrow N(0, 1). \ \ \ \ \ (16)$

Then, the size ${\alpha}$ the Wald Test is: reject ${H_0}$ if ${|W| > z_{\alpha/2}}$ where

$\displaystyle W = \frac{\hat { \theta } - \theta_0 }{\widehat{\textsf{se}}}. \ \ \ \ \ (17)$

Theorem 12 Asymptotically, the Wald test has size ${\alpha}$ .

$\displaystyle size \\ = \underset{\theta \in \Theta_0}{\text{sup}}(\beta(\theta)) \\ = \underset{\theta \in \Theta_0}{\text{sup}}\mathbb{P}_{\theta}(X \in R) \\ = \mathbb{P}_{\theta_0}(X \in R) \\ = \mathbb{P}_{\theta_0}(|W| > z_{\alpha/2}) \\ = \mathbb{P}_{\theta_0}\left(\left| \frac{\hat { \theta } - \theta_0 }{\widehat{\textsf{se}}}\right| > z_{\alpha/2}\right) \\ \rightarrow \mathbb{P}(|Z| > z_{\alpha/2}) \\ = \alpha. \ \ \ \ \ (18)$

Example 2 Two experiments are conducted to test two prediction algorithms.

The prediction algorithms are used to predict the outcomes ${n}$ and ${m}$ times, respectively, and have a probability of predicting with success as ${p_1}$ and ${p_2}$ , respectively.

Let ${\delta = p_1 - p_2}$ .

Consider testing a two-sided hypothesis:

$\displaystyle H_0 \colon \delta = 0 \\ H_1 \colon \delta \neq 0. \ \ \ \ \ (19)$

3. The Likelihood Ratio Test

This test can be used to test vector valued parameters as well.

Definition 13 The likelihood ratio test statistic for testing ${H_0 \colon \theta \in \Theta_0}$ versus ${H_1 \colon \theta \in \Theta_1}$ is

$\displaystyle \lambda(x) = \frac{ \underset{\theta \in \Theta_0}{\text{sup}} (L( \theta|\mathbf{x} )) }{ \underset{\theta \in \Theta}{\text{sup}} (L( \theta|\mathbf{x} )) }. \ \ \ \ \ (20)$

Parametric Inference

There are two methods of estimating ${\theta}$ .

1. Method of Moments

It is a method of generating parametric estimators. These estimators are not optimal but they are easy to compute. They are also used to generate starting values for other numerical parametric estimation methods.

Definition 1 Moments and Sample Moments:

Suppose that the parameter ${\theta}$ has ${k}$ components: ${\theta = (\theta_1,\dotsc,\theta_k)}$ . For ${1 \leq j \leq k}$ ,

Define ${j^{th}}$ moment as

$\displaystyle \alpha_j \equiv \alpha_j(\theta) = \mathbb{E}_\theta(X^j) = \int \mathrm{x}^{j}\,\mathrm{d}F_{\theta}(x). \ \ \ \ \ (1)$

Define ${j^{th}}$ sample moment as

$\displaystyle \hat{\alpha_j} = \frac{1}{n}\sum_{i=1}^n X_i^j. \ \ \ \ \ (2)$

Definition 2

The method of moments estimator ${\hat{\theta_n}}$ is the value of ${\theta}$ which satisfies

$\displaystyle \alpha_1(\hat{\theta_n}) = \hat{\alpha_1} \\ \alpha_2(\hat{\theta_n}) = \hat{\alpha_2} \\ \vdots \vdots \vdots \\ \alpha_k(\hat{\theta_n}) = \hat{\alpha_k} \ \ \ \ \ (3)$

Why The Above Method Works: The method of moments estimator is obtained by equating the ${j^{th}}$ moment with the ${j^{th}}$ sample moment. Since there are k of them we get k equations in k unknowns (the unknowns are the k parameters). This works because we can find out the ${j^{th}}$ moments in terms of the unknown parameters and we can find the ${j^{th}}$ sample moments numerically, since we know the sample values.

2. Maximum Likelihood Method

It is the most common method for estimating parameters in a parametric model.

Definition 3 Likelihood Function: Let ${X_1, \dotsc, X_n}$ have a \textsc{pdf} ${ f_X(x;\theta)}$ . The likelihood function is defined as

$\displaystyle \mathcal{L}_n(\theta) = \prod_{i=1}^n f_X(X_i;\theta). \ \ \ \ \ (4)$

The log-likelihood fucntion is defined as ${\ell_n(\theta) = \log(\mathcal{L}_n(\theta))}$ .

The likelihood function is the joint density of the data. We treat it as a function of the parameter ${\theta}$ . Thus ${\mathcal{L}_n \colon \Theta \rightarrow [0, \infty)}$ .

Definition 4 Maximum Likelihood Estimator: It is the value of ${\theta}$ , ${\hat{\theta}}$ which maximizes the likelihood function.

Example 1 Let ${X_1, \dotsc, X_n}$ be \textsc{iid} random variables with ${\text{Unif}(0, \theta)}$ distribution.

$\displaystyle f_X(x; \theta) = \begin{cases} 1/\theta 0 < x < \theta \\ 0 \text{otherwise}. \end{cases} \ \ \ \ \ (5)$

If ${X_{max} = \max\{X_1, \dotsc, X_n \}}$ and ${X_{max} > \theta}$ , then ${\mathcal{L}_n(\theta) = 0}$ . Otherwise ${\mathcal{L}_n(\theta) = (\frac{1}{\theta})^n }$ which is a decreasing function of ${\theta}$ . Hence ${\hat{\theta} = \max\{ \mathcal{L}_n(\theta)\} = X_{max}}$ .

3. Properties of MLE

Consistent: MLE is consistent: The estimate converges to the true value in probability.

Equivariant: MLE is equivariant. Functions of estimate are estimator of functions of true parameters.

Asymptotically Normal: MLE is asymptotically normal.

Asymptotically Optimal It has the smallest variance among all other well behaved estimators.

Bayes Estimator: MLE is also the Bayes Estimator.

4. Consistency of MLE

Definition 5 Kullback Leibler Distance:
If ${f}$ and ${g}$ are \textsc{pdf}, the Kullback Leibler distance between them is defined as

$\displaystyle D(f, g) = \int f(x) \log \left( \frac{f(x) }{g(x) } \right) dx. \ \ \ \ \ (6)$

5. Equivariance of MLE

6. Asymptotic Normality of MLE

The distribution of ${\hat{\theta}}$ is asymptotically normal. We need the following definitions to prove it.

Definition 6 Score Function: Let ${X}$ be a random variable with \textsc{pdf} ${f_X(x; \theta)}$ . Then the score function is defined as

$\displaystyle s(X; \theta) = \frac{\partial \log f_X(x; \theta) }{\partial \theta}. \ \ \ \ \ (7)$

Definition 7 Fisher Information: The Fisher Information is defined as

$\displaystyle I_n(\theta) = \mathbb{V}_{\theta}\left( \sum_{i=1}^n s(X_i; \theta) \right) \\ = \sum_{i=1}^n \mathbb{V}_{\theta}\left(s(X_i; \theta) \right). \ \ \ \ \ (8)$

Theorem 8

$\displaystyle \mathbb{E}_{\theta}(s(X; \theta)) = 0. \ \ \ \ \ (9)$

Theorem 9

$\displaystyle \mathbb{V}_{\theta}(s(X; \theta)) = \mathbb{E}_{\theta}(s^2(X; \theta)). \ \ \ \ \ (10)$

Theorem 10

$\displaystyle I_n(\theta) = nI(\theta). \ \ \ \ \ (11)$

Theorem 11

$\displaystyle I(\theta) = -\mathbb{E}_{\theta}\left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right) \\ = -\int \left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right)f_X(x; \theta) dx. \ \ \ \ \ (12)$

Definition 12 Let ${ \textsf{se} = \sqrt{\mathbb{V}(\hat{\theta} ) } }$ .

Theorem 13

$\displaystyle \textsf{se} \approx \sqrt{1/I_n(\theta)}. \ \ \ \ \ (13)$

Theorem 14

$\displaystyle \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (14)$

Theorem 15 Let ${ \hat{\textsf{se}} = \sqrt{1/I_n(\hat{\theta)}} }$ .

$\displaystyle \frac{\hat { \theta } - \theta }{\hat{\textsf{se}}} \rightsquigarrow N(0, 1). \ \ \ \ \ (15)$

In words, ${(a, b)}$ traps ${\theta}$ with probability ${1 - \alpha}$ . We call ${1 - \alpha}$ the coverage of the confidence interval. ${C_n}$ is random and ${\theta}$ is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing ${\alpha}$ = 0.05. If ${\theta}$ is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

Theorem 16 (Normal Based Confidence Intervals)

Let ${\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}$ .

and let ${\Phi}$ be the \textsc{cdf} of a random variable ${Z}$ with standard normal distribution and

$\displaystyle z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (16)$

and let

$\displaystyle C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (17)$

Then,

$\displaystyle \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (18)$

For 95% confidence intervals ${ 1 - \alpha}$ is .95, ${\alpha}$ is .05, ${z _{\alpha / 2}}$ is 1.96 and the interval is thus ${ C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) }$ .

Sicp Exercise 3.23

Exercise 3.23. A deque (“double-ended queue”) is a sequence in which items can be inserted and deleted at either the front or the rear. Operations on deques are the constructor make-deque, the predicate empty-deque?, selectors front-deque and rear-deque, and mutators front-insert-deque!, rear-insert-deque!, front-delete-deque!, and rear-delete-deque!. Show how to represent deques using pairs, and give implementations of the operations.²³ All operations should be accomplished in (1) steps.

Sicp Exercise 3.50

Exercise 3.50. Complete the following definition, which generalizes stream-map to allow procedures that take multiple arguments, analogous to map in section 2.2.3, footnote 12.

(define (stream-map proc . argstreams) (if (<??> (car argstreams)) the-empty-stream (<??> (apply proc (map <??> argstreams)) (apply stream-map (cons proc (map <??> argstreams))))))

Sicp Exercise 1.09

Exercise 1.9. Each of the following two procedures defines a method for adding two positive integers in terms of the procedures inc, which increments its argument by 1, and dec, which decrements its argument by 1.

(define (+ a b)
(if (= a 0)
b
(inc (+ (dec a) b))))

(define (+ a b)
(if (= a 0)
b
(+ (dec a) (inc b))))

Using the substitution model, illustrate the process generated by each procedure in evaluating (+ 4 5). Are these processes iterative or recursive?