Probability

Convergence of Random Variables

1. Introduction

There are two main ideas in this article.

Law of Large Numbers:
This states that the mean of the sample ${\overline{X}_n}$ converges in probability to the distribution mean ${\mu}$ as ${n}$ increases.

Central Limit Theorem:
This states that the distribution of the sample mean converges in distribution to a normal distribution as ${n}$ increases.

2. Types of Convergence

Let ${X_1, \dotsc, X_n}$ be a sequence of random variables with distributions ${F_n}$ and let ${X}$ be a random variable with distribution ${F}$ .

Definition 1 Convergence in Probability: ${X_n}$ converges to ${X}$ in probability, written as ${X_n \overset{P}{\longrightarrow} X}$ , if for every ${\epsilon > 0}$ , we have

$\displaystyle \mathbb{P}(|X_n - X| > \epsilon) \rightarrow 0 \ \ \ \ \ (1)$

as ${n \rightarrow \infty}$ .

Definition 2 Convergence in Distribution: ${X_n}$ converges to ${X}$ in distribution, written as ${X_n \rightsquigarrow X}$ , if

$\displaystyle \underset{n \rightarrow \infty}{\text{lim}} F_n(t) = F(t) \ \ \ \ \ (2)$

at all ${t}$ for which ${F}$ is continuous.

3. The Law of Large Numbers

Let ${X_1, X_2, \dotsc}$ be \textsc{iid} with mean ${\mu = \mathbb{E}(X_1)}$ and variance ${\sigma^2 = \mathbb{V}(X_1)}$ . Let sample mean be defined as ${\overline{X}_n = (1/n)\sum_{i=1}^n X_i}$ . It can be shown that ${\mathbb{E}(\overline{X}_n) = \mu}$ and ${\mathbb{V}(\overline{X}_n) = \sigma^2/n}$ .

Theorem 3 Weak Law of Large Numbers: If ${X_1, \dotsc, X_n}$ are \textsc{iid} random variables, then ${\overline{X}_n \overset{P}{\longrightarrow} \mu}$ .

4. The Central Limit Theorem

The law of large numbers says that the distribution of the sample mean, ${\overline{X}_n}$ , piles up near the true distribution mean, ${\mu}$ . The central limit theorem further adds that the distribution of the sample mean approaches a Normal distribution as n gets large. It even gives the mean and the variance of the normal distribution.

Theorem 4 The Central Limit Theorem: Let ${X_1, \dotsc, X_n}$ be \textsc{iid} random variables with mean ${\mu}$ and standard deviation ${\sigma}$ . Let sample mean be defined as ${\overline{X}_n = (1/n)\sum_{i=1}^n X_i}$ . Then the asymptotic behaviour of the distribution of the sample mean is given by

$\displaystyle Z_n = \frac{ (\overline{X}_n - \mu) } { \sqrt{ \mathbb{V}( \overline{X}_n ) } } = \frac{ \sqrt{n}(\overline{X}_n - \mu) } { \sigma } \rightsquigarrow N(0,1) \ \ \ \ \ (3)$

The other ways of expressing the above equation are

$\displaystyle Z_n \approx N(0, 1) \\ \overline{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right) \\ (\overline{X}_n - \mu) \approx N(0, \sigma^2/n) \\ \sqrt{n}(\overline{X}_n - \mu) \approx N(0,\sigma^2) \\ \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma} \approx N(0,1). \ \ \ \ \ (4)$

Definition 5 As has been defined in the Expectation chapter, if ${X_1, \dotsc, X_n}$ are random variables, then we define the sample variance as

$\displaystyle S_n^2 = \frac{1}{n - 1}\left(\sum_{i=1}^n (\overline{X}_n - X_i)^2\right). \ \ \ \ \ (5)$

Theorem 6 Assuming the conditions of the CLT,

$\displaystyle \frac{\sqrt{n}(\overline{X}_n - \mu)}{S_n} \approx N(0,1). \ \ \ \ \ (6)$

5. The Delta Method

Theorem 7 Let ${Y_n}$ be a random variable with conditions of CLT met, and let ${g(x)}$ be a differentiable function with ${g'(\mu) \neq 0}$ . Then,

$\displaystyle Y_n \approx N\left(\mu, \frac{\sigma^2}{n}\right) \\ \Longrightarrow \quad g(Y_n) \approx N\left(g(\mu), (g'(\mu))^2\frac{\sigma^2}{n}\right). \ \ \ \ \ (7)$

Expectation of Random Variables

1. Expectation of a Random Variable

The expectation of a random variable is the average value of ${X}$ .

Definition 1 The expectation, mean or first moment of ${X}$ is defined to be

$\displaystyle \mathbb{E}(X) = \int x f(x) dx = \begin{cases} \sum_x x f(x) X \text{ is discrete} \\ \int_x x f(x) dx X \text{ is continuous}. \end{cases} \ \ \ \ \ (1)$

The following notations are also used.

$\displaystyle \mathbb{E}(X) = \mathbb{E}X = \int x \thinspace dF(x) = \mu_X = \mu \ \ \ \ \ (2)$

Theorem 2 The Rule of the Lazy Statistician: Let ${Y = r(X)}$ , then the expectation of Y is

$\displaystyle \mathbb{E}(Y) = \int r(X) \thinspace dF_X(x). \ \ \ \ \ (3)$

2. Properties of Expectation

Theorem 3 If ${X_1, \dotsc, X_n}$ are random variables and ${a_1, \dotsc, a_n}$ are constants, then

$\displaystyle \mathbb{E}\left(\sum_{i=1}^n a_iX_i\right) = \sum_{i=1}^na_i\mathbb{E}(X_i). \ \ \ \ \ (4)$

Theorem 4 If ${X_1, \dotsc, X_n}$ are independent random variables, then

$\displaystyle \mathbb{E}\left(\prod_{i=1}^n X_i\right) = \prod_{i=1}^n\mathbb{E}(X_i). \ \ \ \ \ (5)$

3. Variance and Covariance

Definition 5 Let ${X}$ be a random variable with mean ${\mu}$ . The variance of ${X}$ , denoted by ${\mathbb{V}(X)}$ , ${\mathbb{V}X}$ , or ${\sigma^2}$ or ${\sigma_X^2}$ is defined by:

$\displaystyle \mathbb{V}(X) = \mathbb{E}((X - \mu)^2) = \int (X - \mu)^2 \thinspace dF(x) \ \ \ \ \ (6)$

assuming the variance exists. The standard deviation is the square root of the variance.

Definition 6 If ${X_1, \dotsc, X_n}$ are random variables, then we define the sample mean as

$\displaystyle \overline{X}_n = \frac{1}{n}\left(\sum_{i=1}^n X_i\right). \ \ \ \ \ (7)$

Definition 7 If ${X_1, \dotsc, X_n}$ are random variables, then we define the sample variance as

$\displaystyle S_n^2 = \frac{1}{n - 1}\left(\sum_{i=1}^n (\overline{X}_n - X_i)^2\right). \ \ \ \ \ (8)$

4. Properties of Variance

Theorem 8

$\displaystyle \mathbb{V}(X) = \mathbb{E}(X^2) - \mu^2. \ \ \ \ \ (9)$

Theorem 9 If ${a}$ and ${b}$ are constants, then

$\displaystyle \mathbb{V}(aX + b) = a^2 \mathbb{V}(X). \ \ \ \ \ (10)$

Theorem 10 If ${X_1, \dotsc, X_n}$ are random variables and ${a_1, \dotsc, a_n}$ are constants, then

$\displaystyle \mathbb{V}\left(\sum_{i=1}^n a_iX_i\right) = \sum_{i=1}^n{a_i}^2\mathbb{V}(X_i). \ \ \ \ \ (11)$

Theorem 11 If ${X_1, \dotsc, X_n}$ are \textsc{iid} random variables with mean ${\mu}$ and variance ${\sigma^2}$ , then

$\displaystyle \mathbb{E}(\overline{X}_n) = \mu, \quad \mathbb{V}(\overline{X}_n) = \frac{\sigma^2}{n} \quad \text{ and } \quad \mathbb{E}(S_n^2) = \sigma^2. \ \ \ \ \ (12)$

Random Variables

1. Introduction

Definition 1

Random Variable: A random variable ${X}$ is a mapping
$\displaystyle X \colon \Omega \rightarrow \mathbb{R} \ \ \ \ \ (1)$

which assigns real numbers ${X(\omega)}$ to outcomes ${\omega}$ in ${\Omega}$ .

2. Distribution Functions

Definition 2

Distribution Function: Given a random variable ${X}$ , the cumulative distribution function (also called the \textsc{cdf}) is a function ${F_X \colon \mathbb{R} \rightarrow [0,1]}$ defined by:
$\displaystyle F_X(x) = \mathbb{P}(X \leq x) \ \ \ \ \ (2)$

Theorem 3 Let ${X}$ have \textsc{cdf} ${F}$ and let ${Y}$ have \textsc{cdf} ${G}$ . If ${F(x) = G(x)}$ for all ${x \in \mathbb{R}}$ then ${\mathbb{P}(X \in A) = \mathbb{P}(Y \in A)}$ for all ${A}$ .

Definition 4 ${X}$ is discrete if it takes countably many infinite values.

We define the probability function or the probability mass function for ${X}$ by ${f_X(x) = \mathbb{P}(X = x)}$ .

Definition 5 A random variable ${X}$ is said to be continuous if there exists a function ${f_X}$ such that

${f_X(x) \geq 0}$ for all ${x \in \mathbb{R}}$ ,
${\int_{-\infty}^{\infty}f_X(x)dx = 1}$
for all ${a, b \in \mathbb{R}}$ with ${a \leq b}$ we have
$\displaystyle \int_a^b f_X(x)dx = \mathbb{P}(a < X < b) \ \ \ \ \ (3)$

The function ${f_X}$ is called the probability density function and we have

$\displaystyle F_X(x) = \int_{-\infty}^x f_X(t)dt \ \ \ \ \ (4)$

and ${f_X(x) = F_X'(X)}$ at all points for which ${F_X}$ is differentiable.

3. Important Discrete Random Variables

Remark 1 We write ${X \sim F}$ to denote that the random variable ${X}$ has a \textsc{cdf} ${F}$ .

3.1. The Point Mass Distribution

\textsc{The Point Mass Distribution}. X has a point mass distribution at ${a}$ , written ${X \sim \delta_a}$ if ${\mathbb{P}(X = a) = 1}$ . Hence ${F_X}$ is

$\displaystyle F_X(x) = \begin{cases} 0& x < a \\ 1& x \geq a. \end{cases} \ \ \ \ \ (5)$

3.2. The Discrete Uniform Distribution

\textsc{The Discrete Uniform Distribution}. Let ${k > 1}$ be a given integer. Let ${X}$ have a probability mass function given by:

$\displaystyle f_X(x) = \begin{cases} 1/k & 1 \leq x \leq k \\ 0 & \text{otherwise}. \end{cases} \ \ \ \ \ (6)$

Then ${X}$ has a discrete uniform distribution on ${{1, \dotsc , k}}$ .

3.3. The Bernoulli Distribution

\textsc{The Bernoulli Distribution}. Let ${X}$ be a random variable with ${\mathbb{P}(X = 1) = p}$ and ${\mathbb{P}(X = 0) = 1 - p}$ for some ${p \in [0, 1]}$ . We say that ${X}$ has a Bernoulli Distribution written as ${X \sim \text{Bernoulli}(p)}$ . The probability function ${f_X}$ is given by ${f_X(x) = p^x(1 - p)^{(1 - x)} \text{ for } x \in {0, 1}}$ .

3.4. The Binomial Distribution

\textsc{The Binomial Distribution}. Flip a coin ${n}$ times and let ${X}$ denote the number of heads. If ${p}$ denotes the probability of getting heads in a single coin toss and the tosses are assumed to be independent then the \textsc{pdf} of ${X}$ can be shown to be:

$\displaystyle f_X(x) = \begin{cases} \begin{pmatrix} n \\ x \end{pmatrix} p^x(1 - p)^{(n-x)} & 0 \leq x \leq n \\ 0 & \text{otherwise}. \end{cases} \ \ \ \ \ (7)$

3.5. The Geometric Distribution

\textsc{The Geometric Distribution}. ${X}$ has a geometric distribution with parameter ${p \in [0, 1]}$ , written as ${X \sim \text{Geom}(p)}$ if

$\displaystyle f_X(x) = p(1 - p)^{(x - 1)} \text{ for } x \in \{1, 2, 3, \dotsc , \}. \ \ \ \ \ (8)$

${X}$ is the number of flips needed until the first head appears.

3.6. The Poisson Distribution

\textsc{The Poisson Distribution}. ${X}$ has a Poisson distribution with parameter ${\lambda > 0}$ , written as ${X \sim \text{Poisson}(\lambda)}$ if

$\displaystyle f_X(x) = e^{-\lambda}\frac{\lambda^{-x}}{x!} \text{ for } x \geq 0. \ \ \ \ \ (9)$

${X}$ is the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

4. Important Continuous Random Variables

4.1. The Uniform Distribution

\textsc{The Uniform Distribution}. For ${a, b \in \mathbb{R} \text{ and } a < b}$ , X has a uniform distribution over ${(a, b)}$ , written ${X \sim \text{Uniform}(a, b)}$ , if

$\displaystyle f_X(x) = \begin{cases} \frac{1}{b - a} & x \in [a, b] \\ 0 & \text{otherwise}. \end{cases} \ \ \ \ \ (10)$

4.2. The Normal Distribution

\textsc{The Normal Distribution}. We say that ${X}$ has a normal (or Gaussian) distribution with parameters ${\mu}$ and ${\sigma}$ , written as ${X \sim N(\mu, \sigma^2)}$ if

$\displaystyle f_X(x;\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\}, \text{ where } \mu, \sigma \in \mathbb{R}, \sigma > 0. \ \ \ \ \ (11)$

The parameter ${\mu}$ is the “center” (or mean) of the distribution and ${\sigma}$ is the “spread” (or standard deviation) of the distribution. We say that ${X}$ has a standard Normal distribution if ${\mu = 0}$ and ${\sigma = 1}$ . A standard Normal random variable is denoted by ${Z}$ . The \textsc{pdf} and \textsc{cdf} of a standard Normal are denoted by ${\phi(z)}$ and ${\Phi(z)}$ . The \textsc{pdf} is plotted in Figure There is no closed-form expression for ${\Phi}$ . Here are some useful facts:

If ${X \sim N(\mu, \sigma^2)}$ , then ${Z = (X - \mu) / \sigma \sim N(0, 1)}$ .

If ${Z \sim N(0, 1)}$ , then ${X = \mu + \sigma Z \sim N(\mu, \sigma^2)}$ .

If ${X_i \sim N(\mu_i, \sigma_i^2)}$ for ${i = 1, \dotsc , n}$ are independent, then we have

$\displaystyle \sum_{i = 1}^nX_i \sim N\left(\sum_{i=1}^n \mu_i, \sum_{i=1}^n \sigma_i^2\right). \ \ \ \ \ (12)$

It follows from ${(i)}$ that if ${X \sim N(\mu, \sigma^2)}$

$\displaystyle \mathbb{P}\left(a < X < b\right) \ \ \ \ \ (13)$

$\displaystyle \mathbb{P}\left(a < X < b\right) = \mathbb{P}\left(a < \mu + \sigma Z < b\right) \ \ \ \ \ (14)$

$\displaystyle \mathbb{P}\left(a < X < b\right) = \mathbb{P}\left(\frac{a - \mu}{\sigma} < Z < \frac{b - \mu}{\sigma}\right) \ \ \ \ \ (15)$

$\displaystyle \mathbb{P}\left(a < X < b\right) = \Phi\left(\frac{b - \mu}{\sigma}\right) - \Phi\left(\frac{a - \mu}{\sigma}\right). \ \ \ \ \ (16)$

Example 1 Suppose that ${X \sim N(3, 5)}$ . Find ${P(X > 1)}$ .

Solution:

$\displaystyle \mathbb{P}\left(X > 1\right) \\ = \mathbb{P}\left(3 + Z\sqrt{5} > 1\right) \\ = \mathbb{P}\left( Z > \frac{-2}{\sqrt{5}}\right) \\ = 1 - \Phi\left(\frac{2}{\sqrt{5}}\right) \\ = 1 - \Phi\left(0.894427\right) \\ = 0.81. \ \ \ \ \ (17)$

Example 2 For the above problem, also find the value ${x}$ of ${X}$ such that ${\mathbb{P}(X < x) = .2}$ . Solution:

$\displaystyle 0.2 = \mathbb{P}\left(X < x\right) \\ = \mathbb{P}\left(3 + Z\sqrt{5} < x\right) \\ = \mathbb{P}\left(Z < \frac{x - 3}{\sqrt{5}}\right) \\ = \Phi\left(\frac{x - 3}{\sqrt{5}}\right) \ \ \ \ \ (18)$

From the normal table, we have that ${\Phi(-0.8416) = 0.2}$

$\displaystyle \Phi(-0.8416) = \Phi\left(\frac{x - 3}{\sqrt{5}}\right) \\ -0.8416 = \left(\frac{x - 3}{\sqrt{5}}\right) \\ x = \left(3 - 0.8416\times\sqrt{5}\right) \\ x = 1.1181. \ \ \ \ \ (19)$

4.3. The Exponential Distribution

\textsc{The Exponential Distribution}. ${X}$ has an exponential distribution with parameter ${\beta > 0}$ , written as ${X \sim \text{Exp}(\beta)}$ , if

$\displaystyle f_X(x) = \frac{1}{\beta}e^{-x/\beta}, \text{ for } x > 0. \ \ \ \ \ (20)$

4.4. The Gamma Distribution

\textsc{The Gamma Distribution}. For ${\alpha > 0}$ , the Gamma function is defined as

$\displaystyle \Gamma(\alpha) = \int_0^\infty y^{\alpha - 1} e^{-y} dy. \ \ \ \ \ (21)$

${X}$ has a Gamma distribution with parameters ${\alpha}$ and ${\beta}$ (where ${\alpha, \beta > 0}$ ), written as ${X \sim \text{Gamma}(\alpha, \beta)}$ , if

$\displaystyle f_X(x) = \frac{1}{\beta^{\alpha}\Gamma(\alpha)}x^{\alpha - 1}e^{-x/\beta}, \text{ for } x > 0. \ \ \ \ \ (22)$

5. Bivariate Distributions

Definition 6 Given a pair of discrete random variables ${X}$ and ${Y}$ , their joint mass function is defined as ${f_{X, Y}(x,y) = \mathbb{P}(X = x, Y = y)}$

Definition 7 For two continuous random variables, ${X}$ and ${Y}$ , we call a function ${f_{X,Y}}$ a \textsc{pdf} of random variables ${(X, Y)}$ if

${f_{X, Y}(x, y) \geq 0}$ for all ${(x, y) \in \mathbb{R}^2}$ ,
${\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x, y) \thinspace dx \thinspace dy = 1}$
For any set ${A \in \mathbb{R}^2}$ we have
$\displaystyle \int \int_A f_{X,Y}(x, y) \thinspace dx \thinspace dy = \mathbb{P}((X,Y) \in A). \ \ \ \ \ (23)$

Example 3 For ${-1 \leq x \leq 1}$ , let ${(X, Y)}$ have density

$\displaystyle f_{X, Y}(x,y) = \begin{cases} cx^2y x^2 \leq y \leq 1, \\ 0 \text{otherwise}. \end{cases} \ \ \ \ \ (24)$

Find the value of ${c}$ .

Solution: We equate the integral of ${f}$ over ${\mathbb{R}^2}$ to ${1}$ and find ${c}$ .

$\displaystyle 1 = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x, y) \thinspace dy \thinspace dx \\ = \int_{-1}^{1}\int_{x^2}^{1}f_{X,Y}(x, y) \thinspace dy \thinspace dx \\ = \int_{-1}^{1}\int_{x^2}^{1}cyx^2 \thinspace dy \thinspace dx \\ = \int_{-1}^{1}c\left(\frac{1 - x^4}{2}\right)x^2 \thinspace dx \\ = \left(\frac{c}{2}\right)\left(\int_{-1}^{1}x^2 \thinspace dx - \int_{-1}^{1}x^6 \thinspace dx \right)\\ = \left(\frac{c}{2}\right)\left( \frac{2}{3} - \frac{2}{7}\right)\\ = \left(\frac{4c}{21}\right) \\ c = \frac{21}{4} \ \ \ \ \ (25)$

6. Marginal Distributions

Definition 8 For the discrete case, if ${X, Y}$ have a joint mass distribution ${f_{X, Y}}$ then the marginal distribution of ${X}$ is given by

$\displaystyle f_X(x) = \mathbb{P}(X = x) = \sum_y \mathbb{P}(X = x, Y = y) = \sum_y f_{X,Y}(x, y) \ \ \ \ \ (26)$

and of ${Y}$ is given by

$\displaystyle f_Y(y) = \mathbb{P}(Y = y) = \sum_x \mathbb{P}(X = x, Y = y) = \sum_x f_{X,Y}(x, y) \ \ \ \ \ (27)$

Definition 9 For the continuous case, if ${X, Y}$ have a probability distribution function ${f_{X, Y}}$ then the marginal distribution of ${X}$ is given by

$\displaystyle f_X(x) = \int f_{X,Y}(x, y) \thinspace dy \ \ \ \ \ (28)$

and of ${Y}$ is given by

$\displaystyle f_Y(y) = \int f_{X,Y}(x, y) \thinspace dx \ \ \ \ \ (29)$

7. Independent Random Variables

Definition 10 Two random variables, ${X}$ and ${Y}$ are said to be independent if for every ${A}$ and ${B}$ we have

$\displaystyle \mathbb{P}(X \in A, Y \in B) = \mathbb{P}(X \in A)\mathbb{P}(Y \in B) \ \ \ \ \ (30)$

Theorem 11 Let ${X}$ and ${Y}$ have a joint \textsc{pdf} ${f_{X, Y}}$ . Then ${X \amalg Y}$ if ${f_{X, Y}(x, y) = f_X(x)f_Y(y)}$ for all values of ${x}$ and ${y}$ .

8. Conditional Distributions

Definition 12 Let ${X}$ and ${Y}$ have a joint \textsc{pdf} ${f_{X, Y}}$ . Then the conditional distribution of ${X}$ given Y is defined as

$\displaystyle f_{X|Y}(x|y) = \frac{f_{X, Y}(x, y)}{f_Y(y)} \ \ \ \ \ (31)$

9. Multivariate Distributions and \textsc

Samples}

Definition 13 Independence of ${n}$ random variables: Let ${X = \begin{pmatrix} X_1, \dotsc, X_n \end{pmatrix}}$ where ${X_1, \dotsc, X_n}$ are random variables. Let ${f(x_1, x_2, \dotsc, x_n)}$ denote their \textsc{pdf}. We say that ${X_1, \dotsc, X_n}$ are independent if for every ${A_1, \dotsc, A_n}$ ,

$\displaystyle \mathbb{P}(X_1 \in A_1, \dotsc, X_n \in A_n) = \prod_{i=1}^n \mathbb{P}(X_i \in A_i) \ \ \ \ \ (32)$

Definition 14 If ${X_1, \dotsc, X_n}$ are independent random variables with the same marginal distribution ${F}$ , we say that ${X_1, \dotsc, X_n}$ are \textsc{iid} (identically and independently distributed) random variables and we write:

$\displaystyle \begin{pmatrix} X_1, \dotsc, X_n \end{pmatrix} \sim F \ \ \ \ \ (33)$

If ${F}$ has density ${f}$ then we also write ${\begin{pmatrix} X_1, \dotsc, X_n \end{pmatrix} \sim f}$ . We also call ${X_1, \dotsc, X_n}$ a random sample of size ${n}$ from ${F}$ .

10. The Multivariate Normal Distribution

\textsc{The Multivariate Normal Distribution} In the multivariate normal distribution, the parameter ${\mu}$ is a vector and the parameter ${\sigma}$ is a matrix ${\Sigma}$ . Let

$\displaystyle Z = \begin{pmatrix} Z_1 \\ \vdots \\ Z_k \end{pmatrix} \ \ \ \ \ (34)$

where ${Z_1, \dotsc, Z_k \sim N(0, 1)}$ are independent. The joint density of ${Z}$ is

$\displaystyle f_Z(z) = \frac{1}{{(2\pi)}^{k/2}}\text{exp}\biggl\{-\frac{1}{2}\sum_{j=1}^k {z_j}^2\biggr\} = \frac{1}{{(2\pi)}^{k/2}}\text{exp}\left\{-\frac{1}{2}z^Tz\right\} \\ f_X(x;\mu, \Sigma) = \frac{1}{{(2\pi)}^{k/2}{|\Sigma |}^{1/2}}\text{exp}\left\{-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\right\} \ \ \ \ \ (35)$

Frequentists and Bayesians

1. Interpretations of Probability

1.1. Bayesians and Frequentists

There are two possible ways to interpret the meaning of probability.

1.2. Maps and Territories

Here ‘territory’ refers to the world as it exists or the reality as it is.

The map refers to our model of the world or the way we see and interpret it.

We are constantly building ‘maps’ or models of the territory. The better our maps the closer we are to the ‘truth’.

1.3. PDFs – Existence in Maps or Territory

The main contention between frequentists and bayesians is the question: “Where do probability density functions exist – Do they exist in the map or in the territory?”

Frequentists hold that the probability density functions exist in the territory.

Bayesians believe that they exist only in our map of reality.

For example, suppose we toss a coin. Then the frequentists believe that there is a probability density function which is independent of our maps and interpretation of reality which forms the basis of randomness that is observed in the coin toss. Bayesians believe that if we know the force applied the fingers tossing the coin, the mass, shape and the orientation of the coin, the movement of molecules in the air at the time of tossing it, in summary if we have a very accurate model of reality than the probability density function changes and it might come to a point in this case where the entire question of coins landing heads or tails may become deterministic.

1.4. Many Worlds Interpretation of Quantum Mechanics

This states that there are multiple universe and each one corresponds to some combination of values which the various probability distributions of position and momentum happen to generate. If the many worlds interpretation is correct then our world is deterministic. Then the probability can only be in the map and not the territory.

1. Introduction

Probability is the mathematical language for quantifying uncertainty.

2. Sample Space and Events

The setup begins with an experiment being conducted. It can have a number of outcomes. The following are then defined:

Definition 1

Sample Space: The sample space ${\Omega}$ is the set of all possible outcomes.

Definition 2

Realizations, Sample Outcomes or Elements: These refer to points ${\omega}$ in ${\Omega}$ .

Definition 3

Events: Subsets of sample space are called events.

Example 1 If we toss a coin twice then ${\Omega = \lbrace HT, TH, HH, TT\rbrace}$ and the event that the first toss is heads is ${A = \lbrace HH, HT\rbrace}$ .

The complements, unions, intersections and differences of event sets can be defined and interpreted trivially. ${\Omega}$ is the true event and ${\emptyset}$ is the false event.

Definition 4

Disjoint or Mutually Exclusive Events: ${A_1, A_2, \dotsc,}$ are mutually exclusive events if ${A_i \bigcap A_j = \emptyset}$ whenever ${i \neq j}$ .

Definition 5

Partition of ${\Omega}$ : A partition of ${\Omega}$ is a sequence of disjoint sets such that their union is ${\Omega}$ .

3. Probability

Definition 7

Probability Distribution or a Probability Measure: A function ${\mathbb{P}}$ is called a probability measure or a probability distribution if it satisfies the following three axioms:
Axiom 1: ${\mathbb{P}(\Omega) = 1}$ .
Axiom 2: ${\mathbb{P}(A) \geq 0}$ for every ${A}$ .
Axiom 3: If ${A_1, A_2, \dotsc, }$ are disjoint, then:
$\displaystyle \mathbb{P}\left(\bigcup \limits_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty}\mathbb{P}(A_i) \ \ \ \ \ (2)$

4. Properties of Probability Distributions

One can derive many properties from the definition of probability distribution (Definition 7).

$\displaystyle \mathbb{P}(\emptyset) = 0 \ \ \ \ \ (3)$

$\displaystyle A \subset B \Longrightarrow \mathbb{P}(A) \leq \mathbb{P}(B) \ \ \ \ \ (4)$

$\displaystyle 0 \leq \mathbb{P}(A) \leq 1 \ \ \ \ \ (5)$

$\displaystyle \mathbb{P}(A^c) = 1 - \mathbb{P}(A) \ \ \ \ \ (6)$

$\displaystyle A \bigcap B = \emptyset \Longrightarrow \mathbb{P}\left(A \bigcup B\right) = \mathbb{P}(A) + \mathbb{P}(B) \ \ \ \ \ (7)$

Lemma 8 If ${A}$ and ${B}$ are two events, then

$\displaystyle \mathbb{P}\left(A \bigcup B\right) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}\left(A \bigcap B\right) \ \ \ \ \ (8)$

Theorem 9 Continuity of Probabilities: If ${A_n \rightarrow A}$ , then

$\displaystyle \mathbb{P}(A_n) \rightarrow \mathbb{P}(A) \ \ \ \ \ (9)$

as ${n \rightarrow \infty}$ .

5. Probability on Finite Sample Spaces

If the sample space ${\Omega = \{\omega_1, \omega_2, \dotsc, \omega_n\}}$ is finite and each outcome is equally likely, then:

$\displaystyle \mathbb{P}(A) = \frac{|A|}{|\Omega|} \ \ \ \ \ (10)$

Given ${n}$ objects, the number of ways of arranging or permuting them is

$\displaystyle n! = 1 \times 2 \times \dotsb \times (n - 1) \times n \ \ \ \ \ (11)$

Given ${n}$ objects, the number of ways of selecting or choosing ${k \text{ (where } 1 \leq k \leq n)}$ out of them is

$\displaystyle \begin{pmatrix} n \\ k \end{pmatrix} = \frac{n!}{k!(n-k)!} \ \ \ \ \ (12)$

For example, the number of ways to chose 3 students out of a class of 20 is

$\displaystyle \begin{pmatrix} 20 \\ 3 \end{pmatrix} = \frac{20!}{3!(17)!} = \frac{20 \times 19 \times 18}{1 \times 2 \times 3} = 1140 \ \ \ \ \ (13)$

6. Independent Events

Definition 10

Independent Events: Two events, ${A}$ and ${B}$ are said to be independent if
$\displaystyle \mathbb{P}(AB) = \mathbb{P}(A)\mathbb{P}(B) \ \ \ \ \ (14)$

A set of events ${\{A_i : i \in I\} }$ is independent if

$\displaystyle \mathbb{P}\left(\bigcap_{i \in J} A_i\right) = \prod_{i \in J} \left(\mathbb{P}(A_i)\right) \ \ \ \ \ (15)$

for every finite subset ${J}$ of ${I}$ .

Independence can be of two types – assumed or derived.

Two disjoint events cannot be independent.

7. Conditional Probability

Definition 11

Conditional Probability: The conditional probability of ${A}$ given ${B}$ has occurred is
$\displaystyle \mathbb{P}(A|B) = \frac{\mathbb{P}(A \bigcap B)}{\mathbb{P}(B)}. \ \ \ \ \ (16)$

Remark 1 ${\mathbb{P}(A|B)}$ is the fraction of times ${A}$ occurs in cases when ${B}$ has occurred.

Lemma 12 If ${A}$ and ${B}$ are independent events then ${\mathbb{P}(A|B) = \mathbb{P}(A)}$ . Also, for any pair of events ${A}$ and ${B}$

$\displaystyle \mathbb{P}(AB) = \mathbb{P}(A|B)\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A). \ \ \ \ \ (17)$

8. Bayes’ Theorem

Theorem 13

The Law of Total Probability: Let ${A_1, A_2, \dotsc, A_n}$ be a partition of ${\Omega}$ and let ${B}$ be any event, then:
$\displaystyle \mathbb{P}(B) = \sum_{i=1}^n \mathbb{P}(B|A_i)\mathbb{P}(A_i). \ \ \ \ \ (18)$

Overview of Total Probability Theorem:

We are given
- a partition of the sample space and
- any other event B.
We have found a relation between
- the probability of the single event B and
- the probabilities of the events comprising the partition and the conditional probabilities of the single event B given the events in the partition.

Theorem 14

Bayes’ Theorem: Let ${A_1, A_2, \dotsc, A_n}$ be a partition of ${\Omega}$ such that ${\mathbb{P}(A_i) > 0 }$ for each ${i}$ . If ${\mathbb{P}(B) > 0}$ , then for each ${i = 1, \dotsc, n}$ :
$\displaystyle \mathbb{P}(A_i|B) = \frac{\mathbb{P}(B|A_i)\mathbb{P}(A_i)}{\sum_{j=1}^n \mathbb{P}(B|A_j)\mathbb{P}(A_j)}. \ \ \ \ \ (19)$

Overview of Bayes’ Theorem:

Inputs: We are given

A partition of the sample space: A set of n events covering the sample space.

An other event ${B}$ : ${B}$ is not part of the partition.

Relation Found: We have found a relation between

Probability of ${A_i|B}$ : The probability of the partition events given the single event ${B}$ has occurred.

This has been expressed in terms of

Probability of ${B|A_i}$ : The probability of the single event ${B}$ given the partition events have occurred.

Example 2 Suppose that ${A_1, A_2 \text{ and } A_3}$ are the events that an email is spam, low priority or high priority, respectively. Let ${\mathbb{P}(A_1) = .7, \thinspace \mathbb{P}(A_2) = .2, \text{ and } \mathbb{P}(A_3) = .1 }$ .

Let ${B}$ be the event that the email contains the word “free”.

Let ${\mathbb{P}(B|A_1) = .9, \thinspace \mathbb{P}(B|A_2) = .02, \text{ and } \mathbb{P}(B|A_3) = .01 }$ .

If the email received has the word “free”, what is the probability that it is spam?

Here,

$\displaystyle \mathbb{P}(Spam Email | Email has Word Free) = \mathbb{P}(A_1|B) \ \ \ \ \ (20)$

$\displaystyle \mathbb{P}(A_1|B) = \frac{\mathbb{P}(B|A_1)\mathbb{P}(A_1)}{\sum_{j=1}^3 \mathbb{P}(B|A_j)\mathbb{P}(A_j)} = \frac{.9 \times .7}{.9 \times .7 + .01 \times .2 + .01 \times .1} = .995. \ \ \ \ \ (21)$