Parametric Inference – Gaurav Sharma's Blog

There are two methods of estimating ${\theta}$ .

1. Method of Moments

It is a method of generating parametric estimators. These estimators are not optimal but they are easy to compute. They are also used to generate starting values for other numerical parametric estimation methods.

Definition 1 Moments and Sample Moments:

Suppose that the parameter ${\theta}$ has ${k}$ components: ${\theta = (\theta_1,\dotsc,\theta_k)}$ . For ${1 \leq j \leq k}$ ,

Define ${j^{th}}$ moment as

$\displaystyle \alpha_j \equiv \alpha_j(\theta) = \mathbb{E}_\theta(X^j) = \int \mathrm{x}^{j}\,\mathrm{d}F_{\theta}(x). \ \ \ \ \ (1)$

Define ${j^{th}}$ sample moment as

$\displaystyle \hat{\alpha_j} = \frac{1}{n}\sum_{i=1}^n X_i^j. \ \ \ \ \ (2)$

Definition 2

The method of moments estimator ${\hat{\theta_n}}$ is the value of ${\theta}$ which satisfies

$\displaystyle \alpha_1(\hat{\theta_n}) = \hat{\alpha_1} \\ \alpha_2(\hat{\theta_n}) = \hat{\alpha_2} \\ \vdots \vdots \vdots \\ \alpha_k(\hat{\theta_n}) = \hat{\alpha_k} \ \ \ \ \ (3)$

Why The Above Method Works: The method of moments estimator is obtained by equating the ${j^{th}}$ moment with the ${j^{th}}$ sample moment. Since there are k of them we get k equations in k unknowns (the unknowns are the k parameters). This works because we can find out the ${j^{th}}$ moments in terms of the unknown parameters and we can find the ${j^{th}}$ sample moments numerically, since we know the sample values.

2. Maximum Likelihood Method

It is the most common method for estimating parameters in a parametric model.

Definition 3 Likelihood Function: Let ${X_1, \dotsc, X_n}$ have a \textsc{pdf} ${ f_X(x;\theta)}$ . The likelihood function is defined as

$\displaystyle \mathcal{L}_n(\theta) = \prod_{i=1}^n f_X(X_i;\theta). \ \ \ \ \ (4)$

The log-likelihood fucntion is defined as ${\ell_n(\theta) = \log(\mathcal{L}_n(\theta))}$ .

The likelihood function is the joint density of the data. We treat it as a function of the parameter ${\theta}$ . Thus ${\mathcal{L}_n \colon \Theta \rightarrow [0, \infty)}$ .

Definition 4 Maximum Likelihood Estimator: It is the value of ${\theta}$ , ${\hat{\theta}}$ which maximizes the likelihood function.

Example 1 Let ${X_1, \dotsc, X_n}$ be \textsc{iid} random variables with ${\text{Unif}(0, \theta)}$ distribution.

$\displaystyle f_X(x; \theta) = \begin{cases} 1/\theta 0 < x < \theta \\ 0 \text{otherwise}. \end{cases} \ \ \ \ \ (5)$

If ${X_{max} = \max\{X_1, \dotsc, X_n \}}$ and ${X_{max} > \theta}$ , then ${\mathcal{L}_n(\theta) = 0}$ . Otherwise ${\mathcal{L}_n(\theta) = (\frac{1}{\theta})^n }$ which is a decreasing function of ${\theta}$ . Hence ${\hat{\theta} = \max\{ \mathcal{L}_n(\theta)\} = X_{max}}$ .

3. Properties of MLE

Consistent: MLE is consistent: The estimate converges to the true value in probability.

Equivariant: MLE is equivariant. Functions of estimate are estimator of functions of true parameters.

Asymptotically Normal: MLE is asymptotically normal.

Asymptotically Optimal It has the smallest variance among all other well behaved estimators.

Bayes Estimator: MLE is also the Bayes Estimator.

4. Consistency of MLE

Definition 5 Kullback Leibler Distance:
If ${f}$ and ${g}$ are \textsc{pdf}, the Kullback Leibler distance between them is defined as

$\displaystyle D(f, g) = \int f(x) \log \left( \frac{f(x) }{g(x) } \right) dx. \ \ \ \ \ (6)$

5. Equivariance of MLE

6. Asymptotic Normality of MLE

The distribution of ${\hat{\theta}}$ is asymptotically normal. We need the following definitions to prove it.

Definition 6 Score Function: Let ${X}$ be a random variable with \textsc{pdf} ${f_X(x; \theta)}$ . Then the score function is defined as

$\displaystyle s(X; \theta) = \frac{\partial \log f_X(x; \theta) }{\partial \theta}. \ \ \ \ \ (7)$

Definition 7 Fisher Information: The Fisher Information is defined as

$\displaystyle I_n(\theta) = \mathbb{V}_{\theta}\left( \sum_{i=1}^n s(X_i; \theta) \right) \\ = \sum_{i=1}^n \mathbb{V}_{\theta}\left(s(X_i; \theta) \right). \ \ \ \ \ (8)$

Theorem 8

$\displaystyle \mathbb{E}_{\theta}(s(X; \theta)) = 0. \ \ \ \ \ (9)$

Theorem 9

$\displaystyle \mathbb{V}_{\theta}(s(X; \theta)) = \mathbb{E}_{\theta}(s^2(X; \theta)). \ \ \ \ \ (10)$

Theorem 10

$\displaystyle I_n(\theta) = nI(\theta). \ \ \ \ \ (11)$

Theorem 11

$\displaystyle I(\theta) = -\mathbb{E}_{\theta}\left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right) \\ = -\int \left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right)f_X(x; \theta) dx. \ \ \ \ \ (12)$

Definition 12 Let ${ \textsf{se} = \sqrt{\mathbb{V}(\hat{\theta} ) } }$ .

Theorem 13

$\displaystyle \textsf{se} \approx \sqrt{1/I_n(\theta)}. \ \ \ \ \ (13)$

Theorem 14

$\displaystyle \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (14)$

Theorem 15 Let ${ \hat{\textsf{se}} = \sqrt{1/I_n(\hat{\theta)}} }$ .

$\displaystyle \frac{\hat { \theta } - \theta }{\hat{\textsf{se}}} \rightsquigarrow N(0, 1). \ \ \ \ \ (15)$

In words, ${(a, b)}$ traps ${\theta}$ with probability ${1 - \alpha}$ . We call ${1 - \alpha}$ the coverage of the confidence interval. ${C_n}$ is random and ${\theta}$ is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing ${\alpha}$ = 0.05. If ${\theta}$ is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

Theorem 16 (Normal Based Confidence Intervals)

Let ${\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}$ .

and let ${\Phi}$ be the \textsc{cdf} of a random variable ${Z}$ with standard normal distribution and

$\displaystyle z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (16)$

and let

$\displaystyle C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (17)$

Then,

$\displaystyle \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (18)$

For 95% confidence intervals ${ 1 - \alpha}$ is .95, ${\alpha}$ is .05, ${z _{\alpha / 2}}$ is 1.96 and the interval is thus ${ C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) }$ .