Introduction to Statistical Inference

1. Introduction

We assume that the data we are looking at comes from a probability distribution with some unknown parameters that control the exact shape of the distribution.

Definition 1 Statistical Inference: It is the process of using given data to infer the properties of the distribution (for example the values of the parameters) which generated the data. It is also called ‘learning’ in computer science.

Definition 2 Statistical Models: A statistical model is a set of distributions.

When we find out the form of the distribution (the equations that describe it) and the parameters used in the form we gain more understanding of the source of our data.

2. Parametric Models

Definition 3 Parametric Models: A parametric model is a statistical model which is parameterized by a finite number of parameters. A general form of a parametric model is

$\displaystyle \mathfrak{F} = \{f(x;\theta) : \theta \in \Theta\} \ \ \ \ \ (1)$

where ${\theta}$ is an unknown parameter (or vector of parameters) that can take values in the parameter space ${\Theta}$ .

Example 1 An example of a parametric model is:

$\displaystyle \mathfrak{F} = \{f(x;\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}, \mu \in \mathbb{R}, \sigma > 0\} \ \ \ \ \ (2)$

3. Non-Parametric Models

Definition 4 Non – Parametric Models: A non-parametric model is one in which ${\mathfrak{F}_{ALL} = \{\text{all CDF's}\}}$ cannot be parameterized by a finite number of parameters.

3.1. Non-Paramateric Estimation of Functionals

Definition 5 Sobolev Space: Usually, it is not possible to estimate the probability distribution from the data by just assuming that it exists. We need to restrict the space of possible solutions. One way is to assume that the density function is a smooth function. The restricted space is called Sobolev Space.

Definition 6 Statistical Functional: Any function of \textsc{cdf} ${F}$ is called a statistical functional.

Example 2 Statistical Functionals: The mean, variance and median can be thought of as functions of ${F}$ :

The mean ${\mu}$ is given as:

$\displaystyle \mu = T(F) = \int x dF(x) \ \ \ \ \ (3)$

The variance is given as:

$\displaystyle T(F) = \int x^2 dF(x) - \left(\int xdF(x)\right)^2 \ \ \ \ \ (4)$

The median is given as:

$\displaystyle T(F) = F^{-1}(x) \ \ \ \ \ (5)$

4. Regression

Definition 7 Independent and Dependent Variables: We observe pairs of data: ${(X_1, Y_1),\dotsc,(X_n, Y_n)}$ . ${Y}$ is assumed to depend on ${X}$ which is assumed to be the independent variable. The other names for these are, for:

${X}$ : predictor, regressor, feature or independent variable.
${Y}$ : response variable, outcome or dependent variable.

Definition 8 Regression Function: The regression function is

$\displaystyle r(X) = \mathbb{E}(Y|X=x) \ \ \ \ \ (6)$

Definition 9 Parametric and Non-Parametric Regression Models: If we assume that ${r \in \mathfrak{F} \text{ and } \mathfrak{F}}$ is finite dimensional, then the model is a parametric regression model, otherwise it is a non-parametric regression model.

There can be three categories of regression, based on the purpose for which it was done:

Prediction,
Classification and
Curve Estimation

Definition 10 Prediction: The goal of predicting ${Y}$ based on the value of ${X}$ is called prediction.

Definition 11 Classification: If ${Y}$ is discrete then prediction is instead called classification.

Definition 12 Curve Estimation: If our goal is to estimate the function ${r}$ , then we call this regression or curve estimation.

The regression function ${r(X) = \mathbb{E}(Y|X=x)}$ can be algebraically manipulated to express it in the form

$\displaystyle Y = r(X) + \epsilon \ \ \ \ \ (7)$

where ${\mathbb{E}(\epsilon) = 0}$ .

If ${\mathfrak{F} = \{f(x;\theta) : \theta \in \Theta\}}$ is a parametric model, then we write ${P_{\theta}(X \in A) = \int_A f_X(x) dx }$ to denote the probability that X belongs to A. It does not mean that we are averaging over ${\theta}$ , it means that the probability is calculated assuming the parameter is ${\theta}$ .

5. Fundamental Concepts in Inference

Many inferential problems can be identified as being one of three types: estimation, confidence sets, or hypothesis testing.

5.1. Point Estimates

Definition 13 Point Estimation: Point estimation refers to providing a single “best guess” of some quantity of interest. The quantity of interest could be

a parameter in a parametric model,
a \textsc{cdf} ${F}$ ,
a probability density function ${f}$ ,
a regression function ${r}$ , or
a prediction for a future value ${Y}$ of some random variable.

By convention, we denote a point estimate of ${\theta \text{ by } \hat{\theta}}$ . Since ${\theta}$ is a fixed, unknown quantity, the estimate ${\hat{\theta}}$ depends on the data so ${\hat{\theta}}$ is a random.

Definition 14 Point Estimator of ${\hat{\theta}_n}$ : Formally, let ${X_1, \dotsc, X_n}$ be n \textsc{iid} data points from some distribution ${F}$ . Then, a point estimator ${\hat { \theta } }$ of ${\theta}$ is some function of ${X_1, \dotsc, X_n}$ :

$\displaystyle \hat { \theta } = g( X_1, \dotsc, X_n ). \ \ \ \ \ (8)$

Definition 15 Bias of an Estimator: The bias of an estimator is defined as:

$\displaystyle \textsf{bias}(\hat{\theta}) = \mathbb{E}( \hat{\theta} ) - \theta. \ \ \ \ \ (9)$

Definition 16 Consistent Estimator: A point estimator ${\hat { \theta } }$ of ${\theta}$ is consistent if: ${\hat{\theta} \xrightarrow{P} \theta}$ .

Definition 17 Sampling Distribution: The distribution of ${\hat{\theta}}$ is called sampling distribution.

Definition 18 Standard Error: The standard deviation of the sampling distribution is called standard error denoted by \textsf{se}.

$\displaystyle \textsf{se} = \textsf{se}(\hat{\theta}) = \sqrt{\mathbb{V}(\hat{\theta})}. \ \ \ \ \ (10)$

In some cases, \textsf{se} depends upon the unkown distribution ${F}$ . Its estimate is denoted by ${\widehat{\textsf{se}}}$ .

Definition 19 Mean Squared Error: It is used to evaluate the quality of a point estimator. It is defined as

$\displaystyle \textsf{\textsc{mse}}(\hat{\theta}) = \mathbb{E}_{ \theta } ( \hat{\theta} - \theta)^2. \ \ \ \ \ (11)$

Example 3 Let ${X_1, \dotsc, X_n}$ be \textsc{iid} random variables with Bernoulli distribution. Then ${\hat { p } = \frac{1}{n} \sum_{i = 1}^nX_i }$ . Then, ${\mathbb{E}( \hat { p }) = p}$ . Hence, ${\hat { p }}$ is unbiased.

Definition 20 Asymptotically Normal Estimator: An estimator is asymptotically normal if

$\displaystyle \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (12)$

5.2. Confidence Sets

Definition 21 A ${1 - \alpha}$ confidence interval for a parameter ${\theta}$ is an interval ${C_n = (a, b)}$ (where ${a = a(X_1,\dotsc, X_n ) }$ and ${b = b(X_1,\dotsc, X_n ) }$ are functions of the data), such that

$\displaystyle \mathbb{P}_\theta(\theta \in C_n) \geq 1 - \alpha, \forall \: \theta \in \Theta. \ \ \ \ \ (13)$

In words, ${(a, b)}$ traps ${\theta}$ with probability ${1 - \alpha}$ . We call ${1 - \alpha}$ the coverage of the confidence interval. ${C_n}$ is random and ${\theta}$ is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing ${\alpha}$ = 0.05. If ${\theta}$ is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

Theorem 22 (Normal Based Confidence Intervals)

Let ${\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}$ .

and let ${\Phi}$ be the \textsc{cdf} of a random variable ${Z}$ with standard normal distribution and

$\displaystyle z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (14)$

and let

$\displaystyle C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (15)$

Then,

$\displaystyle \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (16)$

For 95% confidence intervals ${ 1 - \alpha}$ is .95, ${\alpha}$ is .05, ${z _{\alpha / 2}}$ is 1.96 and the interval is thus ${ C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) }$ .

5.3. Hypothesis Testing

In hypothesis testing, we start with some default theory – called a null hypothesis – and we ask if the data provide sufficient evidence to reject the theory. If not we retain the null hypothesis.