Parametric Inference

There are two methods of estimating {\theta}.

1. Method of Moments

It is a method of generating parametric estimators. These estimators are not optimal but they are easy to compute. They are also used to generate starting values for other numerical parametric estimation methods.

Definition 1 Moments and Sample Moments:

Suppose that the parameter {\theta} has {k} components: {\theta = (\theta_1,\dotsc,\theta_k)}. For {1 \leq j \leq k},

Define {j^{th}} moment as

\displaystyle  \alpha_j \equiv \alpha_j(\theta) = \mathbb{E}_\theta(X^j) = \int \mathrm{x}^{j}\,\mathrm{d}F_{\theta}(x). \ \ \ \ \ (1)

Define {j^{th}} sample moment as

\displaystyle  \hat{\alpha_j} = \frac{1}{n}\sum_{i=1}^n X_i^j. \ \ \ \ \ (2)

Definition 2

The method of moments estimator {\hat{\theta_n}} is the value of {\theta} which satisfies

\displaystyle  \alpha_1(\hat{\theta_n}) = \hat{\alpha_1} \\ \alpha_2(\hat{\theta_n}) = \hat{\alpha_2} \\ \vdots \vdots \vdots \\ \alpha_k(\hat{\theta_n}) = \hat{\alpha_k} \ \ \ \ \ (3)

Why The Above Method Works: The method of moments estimator is obtained by equating the {j^{th}} moment with the {j^{th}} sample moment. Since there are k of them we get k equations in k unknowns (the unknowns are the k parameters). This works because we can find out the {j^{th}} moments in terms of the unknown parameters and we can find the {j^{th}} sample moments numerically, since we know the sample values.

2. Maximum Likelihood Method

It is the most common method for estimating parameters in a parametric model.

Definition 3 Likelihood Function: Let {X_1, \dotsc, X_n} have a \textsc{pdf} { f_X(x;\theta)}. The likelihood function is defined as

\displaystyle  \mathcal{L}_n(\theta) = \prod_{i=1}^n f_X(X_i;\theta). \ \ \ \ \ (4)

The log-likelihood fucntion is defined as {\ell_n(\theta) = \log(\mathcal{L}_n(\theta))}.

The likelihood function is the joint density of the data. We treat it as a function of the parameter {\theta}. Thus {\mathcal{L}_n \colon \Theta \rightarrow [0, \infty)}.

Definition 4 Maximum Likelihood Estimator: It is the value of {\theta}, {\hat{\theta}} which maximizes the likelihood function.

Example 1 Let {X_1, \dotsc, X_n} be \textsc{iid} random variables with {\text{Unif}(0, \theta)} distribution.

\displaystyle  f_X(x; \theta) = \begin{cases} 1/\theta 0 < x < \theta \\ 0 \text{otherwise}. \end{cases} \ \ \ \ \ (5)

If {X_{max} = \max\{X_1, \dotsc, X_n \}} and {X_{max} > \theta}, then {\mathcal{L}_n(\theta) = 0}. Otherwise {\mathcal{L}_n(\theta) = (\frac{1}{\theta})^n } which is a decreasing function of {\theta}. Hence {\hat{\theta} = \max\{ \mathcal{L}_n(\theta)\} = X_{max}}.

3. Properties of MLE

  • Consistent: MLE is consistent: The estimate converges to the true value in probability.
  • Equivariant: MLE is equivariant. Functions of estimate are estimator of functions of true parameters.
  • Asymptotically Normal: MLE is asymptotically normal.
  • Asymptotically Optimal It has the smallest variance among all other well behaved estimators.
  • Bayes Estimator: MLE is also the Bayes Estimator.

    4. Consistency of MLE

    Definition 5 Kullback Leibler Distance:
    If {f} and {g} are \textsc{pdf}, the Kullback Leibler distance between them is defined as

    \displaystyle  D(f, g) = \int f(x) \log \left( \frac{f(x) }{g(x) } \right) dx. \ \ \ \ \ (6)

    5. Equivariance of MLE

    6. Asymptotic Normality of MLE

    The distribution of {\hat{\theta}} is asymptotically normal. We need the following definitions to prove it.

    Definition 6 Score Function: Let {X} be a random variable with \textsc{pdf} {f_X(x; \theta)}. Then the score function is defined as

    \displaystyle  s(X; \theta) = \frac{\partial \log f_X(x; \theta) }{\partial \theta}. \ \ \ \ \ (7)

    Definition 7 Fisher Information: The Fisher Information is defined as

    \displaystyle  I_n(\theta) = \mathbb{V}_{\theta}\left( \sum_{i=1}^n s(X_i; \theta) \right) \\ = \sum_{i=1}^n \mathbb{V}_{\theta}\left(s(X_i; \theta) \right). \ \ \ \ \ (8)

    Theorem 8

    \displaystyle  \mathbb{E}_{\theta}(s(X; \theta)) = 0. \ \ \ \ \ (9)

    Theorem 9

    \displaystyle  \mathbb{V}_{\theta}(s(X; \theta)) = \mathbb{E}_{\theta}(s^2(X; \theta)). \ \ \ \ \ (10)

    Theorem 10

    \displaystyle  I_n(\theta) = nI(\theta). \ \ \ \ \ (11)

    Theorem 11

    \displaystyle  I(\theta) = -\mathbb{E}_{\theta}\left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right) \\ = -\int \left( \frac{\partial^2 \log f_X(x; \theta) }{\partial \theta^2} \right)f_X(x; \theta) dx. \ \ \ \ \ (12)

    Definition 12 Let { \textsf{se} = \sqrt{\mathbb{V}(\hat{\theta} ) } }.

    Theorem 13

    \displaystyle  \textsf{se} \approx \sqrt{1/I_n(\theta)}. \ \ \ \ \ (13)

    Theorem 14

    \displaystyle  \frac{\hat { \theta } - \theta }{\textsf{se} } \rightsquigarrow N(0, 1). \ \ \ \ \ (14)

    Theorem 15 Let { \hat{\textsf{se}} = \sqrt{1/I_n(\hat{\theta)}} }.

    \displaystyle  \frac{\hat { \theta } - \theta }{\hat{\textsf{se}}} \rightsquigarrow N(0, 1). \ \ \ \ \ (15)

    In words, {(a, b)} traps {\theta} with probability {1 - \alpha}. We call {1 - \alpha} the coverage of the confidence interval. {C_n} is random and {\theta} is fixed. Commonly, people use 95 percent confidence intervals, which corresponds to choosing {\alpha} = 0.05. If {\theta} is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval.

    Theorem 16 (Normal Based Confidence Intervals)

    Let {\hat{\theta} \approx N(\theta,\hat{\textsf{se}}^2 )}.

    and let {\Phi} be the \textsc{cdf} of a random variable {Z} with standard normal distribution and

    \displaystyle  z_{\alpha / 2} = \Phi^{- 1 } ( 1 - \alpha / 2) \\ \mathbb{P}(Z > z _{\alpha / 2} ) = ( 1 - \alpha / 2) \\ \mathbb{P}(- z _{\alpha / 2} < Z < z _{\alpha / 2}) = 1 - \alpha. \ \ \ \ \ (16)

    and let

    \displaystyle  C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) \ \ \ \ \ (17)

    Then,

    \displaystyle  \mathbb{P}_{\theta} (\theta \in C _n ) \rightarrow 1 - \alpha. \ \ \ \ \ (18)

    For 95% confidence intervals { 1 - \alpha} is .95, {\alpha} is .05, {z _{\alpha / 2}} is 1.96 and the interval is thus { C_n = \left( \hat{\theta} - z _{\alpha / 2}\hat{\textsf{se}} , \hat{\theta} + z _{\alpha / 2}\hat{\textsf{se}} \right) = \left( \hat{\theta} - 1.96\hat{\textsf{se}} , \hat{\theta} + 1.96\hat{\textsf{se}} \right) } .