Statistics/Point Estimation

Statistics
Point Estimation

Introduction

Usually, a random variable $X$ resulting from a random experiment is assumed to follow a certain distribution with an unknown (but fixed) parameter (vector) ^[1] $\theta \in \mathbb {R} ^{k}$ ^[2] ( $k$ is a positive integer, and its value depends on the distribution), taking value in a set $\Theta$ , called the parameter space.

Remark.

In the context of frequentist statistics (the context here), parameters are regarded as fixed.
On the other hand, in the context of Bayesian statistics, parameters are regarded as random variables.

For example, suppose the random variable $X$ is assumed to follow a normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Then, in this case, the parameter vector $\theta =(\mu ,\sigma )\in \Theta$ is unknown, and the parameter space $\Theta =\{(\mu ,\sigma ):\mu \in \mathbb {R} ,\sigma >0\}$ . It is often useful to estimate those unknown parameters in some ways to "understand" the random variable $X$ better. We would like to make sure the estimation should be "good" ^[3] enough, so that the understanding is more accurate.

Intuitively, the (realization of) random sample $X_{1},\dotsc ,X_{n}$ should be useful. Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what point estimates mean. To be more precise, let us define point estimation and point estimates.

Definition. (Point estimation) Point estimation is a process of using the value of a statistic to give a single value estimate (which can be interpreted as a point) of an unknown parameter.

Remark.

Recall that statistics are functions of a random sample.
We call the unknown parameter as population parameter (since the underlying distribution corresponding to the parameter is called a population).
The statistics is called a point estimator, and its realization is called a point estimate.

The notation of point estimator commonly has a ${\hat {}}$ .

Point estimation will be contrasted with interval estimation, which uses the value of a statistic to estimate an interval of plausible values of the unknown parameter.

Example. Suppose $X_{1},\dotsc ,X_{n}$ are $n$ random samples from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ .

We may use the statistic ${\overline {X}}={\frac {X_{1}+\dotsb +X_{n}}{n}}$ to estimate $\mu$ intuitively, and ${\overline {X}}$ is called the point estimator, and its realization ${\overline {x}}$ is called the point estimate.
Alternatively, we may simply use the statistic $X_{1}$ (despite it does not involve $X_{2},\dotsc ,X_{n}$ , it can still be regarded as function of $X_{1},\dotsc ,X_{n}$ ) to estimate $\mu$ . That is, we use the value of the first random sample from the normal distribution as the point estimate of the mean of the distribution! Intuitively, it may seem that such estimator is quite "bad".

Such estimator, which just takes one random sample directly, is called a single observation estimator.
We will later discuss how to evaluate how "good" a point estimator is.

In the following, we will introduce two well-known point estimators, which are actually quite "good", namely maximum likelihood estimator and method of moment estimator.

Maximum likelihood estimator (MLE)

As suggested by the name of this estimator, it is the estimator that maximize some kind of "likelihood". Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way). Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence, this "likelihood" should be also based on the random sample in some sense.

To motivate the definition of maximum likelihood estimator, consider the following example.

Example. In a random experiment, a (fair or unfair) coin is tossed once. Let the random variable $X=1$ if head comes up, and $0$ otherwise. Then, the pmf of $X$ is $f(x;p)=p^{x}(1-p)^{1-x},\quad x\in \{0,1\}$ , in which the unknown parameter $p$ represents the probability for head comes up, and $p\in \Theta =\{p:p\in (0,1)\}$ .

Now, suppose you get a random sample $X_{1},X_{2},\dotsc ,X_{n}$ by tossing that coin $n$ independent times (such random sample is called an independent random sample, since the random variables involved are independent), and the corresponding realizations $x_{1},x_{2},\dotsc ,x_{n}$ . Then, the probability for $X_{1}=x_{1},X_{2}=x_{2},\dotsc ,{\text{ and }}X_{n}=x_{n}$ , i.e., the random sample have these realizations exactly, is ${\begin{aligned}\mathbb {P} (X_{1}=x_{1}\cap X_{2}=x_{2}\cap \dotsb \cap X_{n}=x_{n})&=\mathbb {P} (X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2})\dotsb \mathbb {P} (X_{n}=x_{n})&{\text{by independence}}\\&=f(x_{1};p)f(x_{2};p)\dotsb f(x_{n};p)\\&=p^{x_{1}}(1-p)^{1-x_{1}}p^{x_{2}}(1-p)^{1-x_{2}}\dotsb p^{x_{n}}(1-p)^{1-x_{n}}\\&=p^{x_{1}+x_{2}+\dotsb +x_{n}}(1-p)^{n-x_{1}-x_{2}-\dotsb -x_{n}}.\end{aligned}}$

Remark.

Remark on notation: You may observe that there is an additional " $;p$ " in the pmf of $X$ . Such notation means the pmf is with the parameter value $p$ . It is included to emphasize the parameter value we are referring to.
In general, we write $f(\cdot ;\theta )$ for pmf/pdf with the parameter value $\theta$ ( $\theta$ may be a vector).

There are some alternative notations with the same meaning: $f(\cdot |\theta ),f_{\theta }(\cdot ),\dotsc$ .

Similarly, we have similar notations, e.g. $\mathbb {P} _{\theta }(A),\mathbb {P} (A|\theta ),\mathbb {P} (A;\theta ),\dotsc$ , to mean the probability for event $A$ to happen, with the parameter value $\theta$ . (It is more common to use the first notation: $\mathbb {P} _{\theta }(A)$ .)
We also have similar notations for mean, variance, covariance, etc., like $\mathbb {E} _{\theta }[\cdot ],\operatorname {Var} _{\theta }(\cdot ),\operatorname {Cov} _{\theta }(\cdot ),\dotsc$

Intuitively, with these particular realizations (fixed), we would like to find a value of $p$ that maximizes this probability, i.e.,, makes the realizations obtained to be the one that is "most probable" or "with maximum likelihood". Now, let us formally define the terms related to MLE.

Definition. (Likelihood function) Let $X_{1},\dotsc ,X_{n}$ be a random sample with a joint pmf or pdf $f$ , and the parameter (vector) $\theta \in \Theta$ ( $\Theta$ is the parameter space). Suppose $x_{1},\dotsc ,x_{n}$ are the corresponding realizations of the random sample $X_{1},\dotsc ,X_{n}$ . Then, the likelihood function, denoted by ${\mathcal {L}}(\theta ;x_{1},\dotsc ,x_{n})$ , is the function $\theta \mapsto f(x_{1},\dotsc ,x_{n};\theta )$ ( $\theta$ is a variable, and $x_{1},\dotsc ,x_{n}$ are fixed).

Remark.

For simplicity, we may use the notation ${\mathcal {L}}(\theta ;\mathbf {x} )$ instead of ${\mathcal {L}}(\theta ;x_{1},\dotsc ,x_{n})$ . Sometimes, we may also just write " ${\mathcal {L}}(\theta ;\mathbf {x} )$ " for convenience.

When we replace $x_{1},\dotsc ,x_{n}$ by $X_{1},\dotsc ,X_{n}$ , then the resulting "likelihood function" becomes a random variable, and we denote it by ${\mathcal {L}}(\theta ;X_{1},\dotsc ,X_{n})$ or ${\mathcal {L}}(\theta ;\mathbf {X} )$ .

The likelihood function is in contrast with the joint pmf or pdf itself, where $\theta$ is fixed and $x_{1},\dotsc ,x_{n}$ are variables.
When the random sample comes from a discrete distribution, then the value of likelihood function is the probability $\mathbb {P} (X_{1}=x_{1}\cap \dotsb \cap X_{n}=x_{n})$ at the parameter vector $\theta$ . That is, the probability for getting this specific realization exactly.
When the random sample comes from a continuous distribution, then the value of likelihood function is not a probability. Instead, it is only the value of the joint pdf at $(x_{1},\dotsc ,x_{n})$ (which can be greater than one). However, the value can still be used to "reflect" the probability for getting "very close to" this specific realization, where the probability can be obtained by integrating the joint pdf over a "very small" region around $(x_{1},\dotsc ,x_{n})$ .
The natural logarithm of the likelihood function, $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ (or $\ln {\mathcal {L}}(\theta ;\mathbf {X} )$ sometimes), is called the log-likelihood function.
Notice that the "expression" of the likelihood function is actually the same as that of the joint pdf, and just the inputs are different. So, one may still integrate/sum the likelihood function with respect to $x_{1},\dotsc ,x_{n}$ (which changes the likelihood function to the joint pdf/pmf in such context in some sense) as if it is the joint pdf/pmf to get probabilities.

Definition. (Maximum likelihood estimate) Given a likelihood function ${\mathcal {L}}(\theta ;\mathbf {x} )$ , a maximum likelihood estimate of the parameter $\theta$ is a value ${\hat {\theta }}(\mathbf {x} )$ at which ${\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized.

Remark.

The maximum likelihood estimator (MLE) of $\theta$ is ${\hat {\theta }}(\mathbf {X} )$ (obtained by replacing " $x$ " in ${\hat {\theta }}(\mathbf {x} )$ by " $X$ ").

In some other places, the abbreviation MLE can also mean maximum likelihood estimate depending on the context. However, we will just use the abbreviation MLE when we are talking about maximum likelihood estimator here.

Since ${\frac {d}{dy}}\ln y={\frac {1}{y}}>0$ (the domain of natural logarithm function is the set of all positive real numbers), the natural logarithm function is strictly increasing, i.e., the output is larger when the input is larger. Thus, when we find a value at which $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized, ${\mathcal {L}}(\theta ;\mathbf {x} )$ is also maximized at the same value.

Now, let us find the MLE of the unknown parameter $p$ in the previous coin flipping example.

Example. (Motivating example revisited) Recall that we use a coin flipping example to motivate maximum likelihood estimation. $X$ follows the Bernoulli distribution with success probability $p$ . The pmf of $X$ is $f(x;p)=p^{x}(1-p)^{1-x}$ . $X_{1},\dotsc ,X_{n}$ is a random sample from the distribution.

The likelihood function ${\mathcal {L}}(p)$ is the joint pmf of $X_{1},\dotsc ,X_{n}$ ,

${\begin{aligned}\mathbb {P} (X_{1}=x_{1}\cap \dotsb \cap X_{n}=x_{n})&=\prod _{i=1}^{n}f(x_{i};p)&{\text{by independence}}\\&=\prod _{i=1}^{n}p^{x_{i}}(1-p)^{1-x_{i}}\\\end{aligned}}$

The log-likelihood function $\ln {\mathcal {L}}(p)$ is thus

${\begin{aligned}\ln {\mathcal {L}}(p)&=\sum _{i=1}^{n}\ln(p^{x_{i}}(1-p)^{1-x_{i}})\\&=\sum _{i=1}^{n}(\ln(p^{x_{i}})+\ln((1-p)^{1-x_{i}}))\\&=\sum _{i=1}^{n}(x_{i}\ln(p)+(1-x_{i})\ln(1-p))\\&=\sum _{i=1}^{n}(x_{i}\ln(p))+\sum _{i=1}^{n}((1-x_{i})\ln(1-p))\\&=\ln(p)\sum _{i=1}^{n}(x_{i})+\ln(1-p)\sum _{i=1}^{n}(1-x_{i})\\&=\ln(p)\sum _{i=1}^{n}(x_{i})+\ln(1-p)\left(n-\sum _{i=1}^{n}(x_{i})\right)\\\end{aligned}}$

To find the maximum of the log-likelihood function, we may use derivative test learnt in Calculus. Differentiating $\ln {\mathcal {L}}(p)$ with respect to $p$ gives

${\begin{aligned}{\frac {d\ln {\mathcal {L}}(p)}{dp}}&={\frac {1}{\color {blue}p}}\underbrace {\sum _{i=1}^{n}x_{i}} _{{\text{constant wrt }}p}-{\frac {1}{\color {red}1-p}}\underbrace {\left(n-\sum _{i=1}^{n}x_{i}\right)} _{{\text{constant wrt }}p}\\&={\frac {{\color {red}(1-p)}\sum _{i=1}^{n}x_{i}-n{\color {blue}p}+{\color {blue}p}\sum _{i=1}^{n}x_{i}}{{\color {blue}p}{\color {red}(1-p)}}}\\&={\frac {(1-p)n{\overline {x}}-np+pn{\overline {x}}}{p(1-p)}}&\left(\sum _{i=1}^{n}x_{i}=n{\overline {x}}=n\cdot {\frac {\sum _{i=1}^{n}x_{i}}{n}}\right)\\&={\frac {n({\overline {x}}-p)}{p(1-p)}}\end{aligned}}$

To find critical point(s) of $\ln {\mathcal {L}}(p)$ , we set ${\frac {d\ln {\mathcal {L}}(p)}{dp}}=0\implies {\frac {n({\overline {x}}-p)}{p(1-p)}}=0\implies p={\overline {x}}$ (we have $p(1-p)\neq 0$ )
To verify that $\ln {\mathcal {L}}(p)$ actually attains maximum (instead of minimum) at $p={\overline {x}}$ , we need to perform derivative test. In this case, we use first derivative test.
We can see that ${\frac {d\ln {\mathcal {L}}(p)}{dp}}>0$ when $p<{\overline {x}}$ , which makes ${\overline {x}}-p>0$ , and thus ${\frac {d\ln {\mathcal {L}}(p)}{dp}}>0$ . On the other hand, when $p>{\overline {x}}$ , this makes ${\overline {x}}-p<0$ , and thus ${\frac {d\ln {\mathcal {L}}(p)}{dp}}<0$ . As a result, we can conclude that $\ln {\mathcal {L}}(p)$ attains its maximum at $p={\overline {x}}$ . It follows that the MLE of $p$ is ${\overline {X}}$ (not ${\overline {x}}$ , which is instead maximum likelihood estimate!)

Exercise. Use second derivative test to verify that $\ln {\mathcal {L}}(p)$ attains maximum at $p={\overline {x}}$ .

Solution

Since ${\frac {d^{2}\ln {\mathcal {L}}(p)}{dp^{2}}}={\frac {-np(1-p)-n({\overline {x}}-p)(2p)}{p^{2}(1-p)^{2}}}$ , in which the numerator is negative and the denominator is positive. Thus, ${\frac {d^{2}\ln {\mathcal {L}}(p)}{dp^{2}}}<0$ . By second derivative test, this means $\ln {\mathcal {L}}(p)$ attains maximum at $p={\overline {x}}$ .

Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a restricted MLE. We will illustrate this in the following example.

Example. Continue from the previous coin flipping example. Suppose we have a constraint on $p$ where $0\leq p\leq {\frac {1}{2}}$ . Find the MLE of $p$ in this case.

Solution: For the steps about deriving likelihood function and log-likelihood function, they are the same in this case. Without the restriction, the MLE of $p$ is ${\overline {X}}$ . Now, with the restriction, the MLE of $p$ is ${\overline {X}}$ only when ${\overline {X}}\leq {\frac {1}{2}}$ (we always have ${\overline {X}}\geq 0$ since $X\geq 0$ ).

If ${\overline {X}}>{\frac {1}{2}}$ (and thus ${\overline {x}}>1/2$ ), even though $\ln {\mathcal {L}}(p)$ is maximized at $p={\overline {x}}$ , we cannot set the MLE to be ${\overline {X}}$ due to the restriction on $p$ : $0\leq p\leq {\frac {1}{2}}$ . Under this case, this means ${\frac {d\ln {\mathcal {L}}(p)}{dp}}>0$ when $p\leq {\frac {1}{2}}<{\overline {X}}$ (we have ${\frac {d\ln {\mathcal {L}}(p)}{dp}}>0$ when $p<{\overline {x}}$ from previous example), i.e., $\ln {\mathcal {L}}(p)$ is strictly increasing when $p\leq {\frac {1}{2}}$ . Thus, $\ln {\mathcal {L}}(p)$ is maximized when $p={\frac {1}{2}}$ with the restriction. As a result, the MLE of $p$ is ${\frac {1}{2}}$ (the MLE can be a constant, which can still be regarded as a function of $X_{1},\dotsc ,X_{n}$ ).

Therefore, the MLE of $p$ can be written as a case defined function: ${\hat {\theta }}={\begin{cases}{\overline {X}},&{\overline {X}}\leq {\frac {1}{2}}\\{\frac {1}{2}},&{\overline {X}}>{\frac {1}{2}}\end{cases}}$ , or it can be written as ${\hat {\theta }}=\min \left\{{\overline {X}},{\frac {1}{2}}\right\}$

Exercise. Find the MLE of $p$ when ${\frac {1}{2}}\leq p\leq 1$ .

Solution

When ${\overline {X}}<{\frac {1}{2}}$ , we cannot set the MLE to be ${\overline {X}}$ due to the restriction. In this case, we know that ${\frac {d\ln {\mathcal {L}}(p)}{dp}}<0$ when $p\geq {\frac {1}{2}}>{\overline {X}}$ , i.e., $\ln {\mathcal {L}}(p)$ is strictly decreasing when ${\frac {1}{2}}\leq p\leq 1$ . Thus, $\ln {\mathcal {L}}(p)$ is maximized at $p={\frac {1}{2}}$ , and so the MLE of $p$ is ${\frac {1}{2}}$ .
When ${\overline {X}}\geq {\frac {1}{2}}$ , we can set the MLE to be ${\overline {X}}$ at which $\ln {\mathcal {L}}(p)$ is maximized, and so ${\overline {X}}$ is the MLE of $p$ in this case.
Therefore, the MLE of $p$ is ${\hat {\theta }}=\max \left\{{\overline {X}},{\frac {1}{2}}\right\}$ .

To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the uniform distribution ${\mathcal {U}}[0,\beta ]$ . Find the MLE of $\beta$ .

Solution: The pdf of the uniform distribution is $f(x;\beta )={\frac {1}{\beta }}\mathbf {1} \{0\leq x\leq \beta \}$ . Thus, the likelihood function is ${\mathcal {L}}(\beta )=\prod _{i=1}^{n}{\frac {1}{\beta }}\mathbf {1} \{0\leq x_{i}\leq \beta \}={\frac {1}{\beta ^{n}}}\prod _{i=1}^{n}\mathbf {1} \{0\leq x_{i}\leq \beta \}$ .

In order for ${\mathcal {L}}(\beta )$ to attain maximum, first, we need to ensure that $0\leq x_{i}\leq \beta$ for each $i\in \{1,\dotsc ,n\}$ , so that the product of the indicator functions in the likelihood function is nonzero (the value is actually one in this case). Apart from that, since $\beta \mapsto {\frac {1}{\beta ^{n}}}$ is a strictly decreasing function of $\beta$ (because ${\frac {d}{d\beta }}\left({\frac {1}{\beta ^{n}}}\right)={\frac {-n}{\beta ^{n+1}}}<0$ (we have $n,\beta >0$ )), we should pick a $\beta$ that is as small as possible so that ${\frac {1}{\beta ^{n}}}$ , and hence ${\mathcal {L}}(\beta )$ , is as large as possible.

As a result, we should choose a $\beta$ that is as small as possible, subject to the constraint that $0\leq x_{i}\leq \beta$ for each $i\in \{1,\dotsc ,n\}$ , which means that $\beta \geq x_{i}$ (it is always the case that $x_{i}\geq 0$ , regardless of the choice of $\beta$ ) for each $i\in \{1,\dotsc ,n\}$ . It follows that ${\mathcal {L}}(\beta )$ attains maximum when $\beta$ is the maximum of $x_{1},\dotsc ,x_{n}$ . Hence, the MLE of $\beta$ is ${\hat {\beta }}=\max\{X_{1},\dotsc ,X_{n}\}$ .

Exercise. Show that the MLE of $\beta$ does not exist if the uniform distribution becomes ${\mathcal {U}}[0,\beta )$ .

Solution

Proof. In this case, the constraint from the indicator functions become $0\leq x_{i}<\beta$ for each $i\in \{1,\dotsc ,n\}$ . With similar argument, for the MLE of $\beta$ , we should choose a $\beta$ that is as small as possible subject to this constraint, which means $\beta >x_{i}$ for each $i\in \{1,\dotsc ,n\}$ . However, in this case, we cannot set $\beta$ to be the maximum of $x_{1},\dotsc ,x_{n}$ , or else the constraint will not be satisfied and the likelihood function becomes zero due to the indicator function. Instead, we should set $\beta$ to be slightly greater than the maximum of $x_{1},\dotsc ,x_{n}$ , so that the constraint can still be satisifed, and $\beta$ is quite small. However, for each such $\beta >\max\{x_{1},\dotsc ,x_{n}\}$ , we can always chooses a smaller $\beta$ that still satisfies the constraint. For example, for each $\beta$ , the smaller beta, $\beta '$ can be selected as $\max\{x_{1},\dotsc ,x_{n}\}+{\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}>\max\{x_{1},\dotsc ,x_{n}\}$ ^[4]. Hence, we cannot find a minimum value of $\beta$ subject to this constraint. Thus, there is no maximum point for $\ln {\mathcal {L}}(p)$ , and hence the MLE does not exist.

$\Box$

In the following example, we will find the MLE of a parameter vector.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution with mean $\theta _{1}$ and variance $\theta _{2}$ , ${\mathcal {N}}(\theta _{1},\theta _{2})$ . Find the MLE of $(\theta _{1},\theta _{2})$ .

Solution: Let $\theta =(\theta _{1},\theta _{2})$ . The likelihood function is ${\mathcal {L}}(\theta ;\mathbf {x} )=\prod _{i=1}^{n}{\frac {1}{\sqrt {2\pi \theta _{2}}}}\exp \left(-{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}\right)=(2\pi \theta _{2})^{-n/2}\exp \left(-\sum _{i=1}^{n}{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}\right)$ , and hence the log-likelihood function is $\ln {\mathcal {L}}(\theta ;\mathbf {x} )=-{\frac {n}{2}}\ln(2\pi \theta _{2})-\sum _{i=1}^{n}{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}$ . Since this function is multivariate, we may use the second partial derivative test from multivariable calculus to find maximum point(s). However, in this case, we actually do not need to use such test. Instead, we fix the variables one by one to make the function univariate, so that we can use the derivative test for univariate function to find maximum point (with another variable fixed).

Since ${\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}}}={\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})$ and ${\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}}}=-{\frac {2n\pi }{4\pi \theta _{2}}}+{\frac {1}{2\theta _{2}^{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}=-{\frac {n}{2\theta _{2}}}+{\frac {1}{2\theta _{2}^{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}$ .

Also, ${\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}}}=0\implies \sum _{i=1}^{n}(x_{i}-\theta _{1})=0\implies -n\theta _{1}+\sum _{i=1}^{n}x_{i}=0\implies \theta _{1}={\frac {\sum _{i=1}^{n}x_{i}}{n}}={\overline {x}}$ , which is independent from $\theta _{2}$ (this is important for us to use this kind of method) and ${\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}}}=0\implies {\frac {n}{2\theta _{2}}}={\frac {1}{2\theta _{2}^{2}}}\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)\implies n={\frac {1}{\theta _{2}}}\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)\implies \theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}$ .

Since ${\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}^{2}}}={\frac {\partial }{\partial \theta _{1}}}\left({\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})\right)={\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(-1)={\frac {-n}{\theta _{2}}}<0$ , by the second derivative test (for univariate function), $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized at $\theta _{1}={\overline {x}}$ , given any fixed $\theta _{2}$ .

On the other hand, since ${\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}^{2}}}={\frac {n}{2\theta _{2}^{2}}}-{\frac {1}{\theta _{2}^{3}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}$ , and thus $\left.{\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}^{2}}}\right\vert _{\theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}}={\frac {1}{2n\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}-{\frac {n^{3}}{\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}={\frac {1-2n^{4}}{2n\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}<0$ (since $2n^{4}>1$ ).

Thus, by the second derivative test, $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized at $\theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}$ , given any fixed $\theta _{1}$ .

So, now we fix $\theta _{1}={\overline {x}}$ , and thus we have $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized at $\theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}}{n}}=s^{2}$ , where $s^{2}$ is the realization of the sample variance $S^{2}$ . Now, fix $\theta _{2}$ to be $s^{2}$ , and we know that $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ attains maximum at $\theta _{1}={\overline {x}}$ for each fixed $\theta _{2}$ , including this fixed $\theta _{2}=s^{2}$ . As a result, $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ is maximized at $(\theta _{1},\theta _{2})=({\overline {x}},s^{2})$ . Hence, the MLE of $(\theta _{1},\theta _{2})$ is $({\overline {X}},S^{2})$ .

Exercise.

(a) Calculate the determinant of the Hessian matrix of $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ at $(\theta _{1},\theta _{2})=({\overline {x}},s^{2})$ , which can be expressed as ${\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2}){\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}^{2}}}({\overline {x}},s^{2})-\left({\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}\partial \theta _{1}}}({\overline {x}},s^{2})\right)^{2}$ .

(b) Hence, verify that $(\theta _{1},\theta _{2})=({\overline {x}},s^{2})$ is the maximum point of $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ using the second partial derivative test.

Solution

(a) First,

${\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {-n}{\theta _{2}}}\right\vert _{\theta _{2}=s^{2}}={\frac {-n}{s^{2}}}$
${\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}^{2}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {n}{2\theta _{2}^{2}}}-{\frac {1}{\theta _{2}^{3}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}={\frac {n}{2(s^{2})^{2}}}-{\frac {1}{(s^{2})^{3}}}\cdot ns^{2}={\frac {n}{2(s^{2})^{2}}}-{\frac {n}{(s^{2})^{2}}}={\frac {-n}{2(s^{2})^{2}}}$
${\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}\partial \theta _{1}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {\partial }{\partial \theta _{2}}}\left({\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})\right)\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}=\left.-{\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})}{\theta _{2}^{2}}}\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}=-{\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})}{(s^{2})^{2}}}=-{\frac {\sum _{i=1}^{n}(x_{i})-n{\overline {x}}}{(s^{2})^{2}}}=-{\frac {n{\overline {x}}-n{\overline {x}}}{(s^{2})^{2}}}=0$

As a result, the determinant of the Hessian matrix is ${\frac {-n}{s^{2}}}\cdot {\frac {-n}{2(s^{2})^{2}}}={\frac {n^{2}}{2(s^{2})^{3}}}$ .

(b) From (a), the determinant of the Hessian matrix is positive. Also, ${\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2})=-{\frac {n}{s^{2}}}<0$ . Thus, by the second partial derivative test, $\ln {\mathcal {L}}(\theta ;\mathbf {x} )$ attains maximum at $(\theta _{1},\theta _{2})=({\overline {x}},s^{2})$ .

Exercise. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the exponential distribution with rate parameter $\lambda$ , with pdf $f(x;\lambda )=\lambda e^{-\lambda x},\quad x\geq 0$ , where $\lambda >0$ . Show that the MLE of $\lambda$ is ${\frac {1}{\overline {X}}}$ .

Solution

Proof. The likelihood function is ${\mathcal {L}}(\lambda )=\prod _{i=1}^{n}(\lambda e^{-\lambda x_{i}})=\lambda ^{n}\exp \left(-\lambda \sum _{i=1}^{n}x_{i}\right)$ . Thus, the log-likelihood function is $\ln {\mathcal {L}}(\lambda )=n\ln \lambda -\lambda \sum _{i=1}^{n}x_{i}$ . Differentiating the log-likelihood function with respect to $\lambda$ gives ${\frac {d}{d\lambda }}\ln {\mathcal {L}}(\lambda )={\frac {n}{\lambda }}-\sum _{i=1}^{n}x_{i}$ . Setting the derivative to be zero, we get ${\frac {n}{\lambda }}-\sum _{i=1}^{n}x_{i}=0\implies {\frac {n}{\lambda }}-n{\overline {x}}=0\implies {\frac {1}{\lambda }}={\overline {x}}\implies \lambda ={\frac {1}{\overline {x}}}$ . It remains to verify that $\ln {\mathcal {L}}(\lambda )$ attains maximum at $\lambda ={\frac {1}{\overline {x}}}$ . Since ${\frac {d^{2}}{d\lambda ^{2}}}\ln {\mathcal {L}}(\lambda )=-{\frac {n}{\lambda ^{2}}}<0$ , this is verified. Hence, the MLE of $\lambda$ is ${\frac {1}{\overline {X}}}$ .

$\Box$

Example. (Application of maximum likelihood estimation) Suppose you are given a box which contains four balls, with unknown number of red and black balls. Now, you draw three balls out of the box, and find out that you get two red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution: Given the color of the balls drawn, we know that the box contains at least two red balls and at least one black ball. This means the box contains either two red balls or three red balls. Let $r$ be the number of red balls inside the box. Then, the number of black balls inside the box is $4-r$ . The possible values of parameter $r$ are 2 and 3.

Now, we compare the probability of getting such result from drawing three balls when $r=2$ and $r=3$ .

For $r=2$ , the probability is ${\frac {{\binom {2}{2}}{\binom {2}{1}}}{\binom {4}{3}}}={\frac {1}{2}}$ (consider the pmf of hypergeometric distribution).
For $r=3$ , the probability is ${\frac {{\binom {3}{2}}{\binom {1}{1}}}{\binom {4}{3}}}={\frac {3}{4}}$ .

Hence, the maximum likelihood estimate of $r$ is 3. Thus, the estimated number of the red balls is 3, and that of the black balls is 1.

Exercise. Suppose the box now contains 100 balls, with unknown number of red and black balls. Now, you draw 99 balls out of the box, and find out that you get 98 red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution

Similarly, the box contains at least 98 red balls and one black ball. We use the same notation as in the above example. Then, the number of the black balls is $100-r$ , and the possible values of parameter $r$ are 98 and 99.

For $r=98$ , the probability is ${\frac {{\binom {98}{98}}{\binom {2}{1}}}{\binom {100}{99}}}=0.02$
For $r=99$ , the probability is ${\frac {{\binom {99}{98}}{\binom {1}{1}}}{\binom {100}{99}}}=0.99$

Thus, the maximum likelihood estimate of $r$ is 99. Thus, the estimated number of the red balls is 99 and that of the black balls is 1.

Remark.

The difference of the probabilities between two possible values of $r$ becomes much larger in this case.
Intuitively, when you have such draw result, you will think that it is quite unlikely that the box has two black balls inside, i.e., the ball that is not drawn is actually black, and somehow you draw all red balls out, but not the black ball.

Method of moments estimator (MME)

For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution. However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution, e.g. mean, variance, and some moments ( $r$ th moment of a random variable $X$ is $\mathbb {E} [X^{r}]$ , we denote it by $\mu _{r}$ for simplicity). Such moments often contain information about the unknown parameter. For example, for a normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ , we know that $\mu =\mu _{1}$ and $\sigma ^{2}=\mu _{2}-(\mu _{1})^{2}$ . Because of this, when we want to estimate the parameters, we can do this through estimating the moments.

Now, we would like to know how to estimate the moments. We let $m_{r}={\frac {\sum _{i=1}^{n}X_{i}^{r}}{n}}$ be the $r$ th sample moment ^[5], where $X_{i}$ 's are independent and identically distributed. By weak law of large number (assuming the conditions are satisified), we have

${\overline {X}}=m_{1}\;{\overset {p}{\to }}\;\mathbb {E} [X]=\mu _{1}$
$m_{2}\;{\overset {p}{\to }}\;\mathbb {E} [X^{2}]=\mu _{2}$ (this can be seen from replacing the " $X$ " by " $X^{2}$ " in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)

In general, we have $m_{r}\;{\overset {p}{\to }}\;\mu _{r}$ , since the conditions are still satisfied after replacing the " $X$ " by " $X^{r}$ " in the weak law of large number.

Because of these results, we can estimate the $r$ -th moment $\mu _{r}$ using the $r$ -th sample moment $m_{r}$ , and the estimation is "better" when $n$ is large. For example, in the above normal distribution example, we can estimate $\mu$ by $m_{1}$ and $\sigma ^{2}$ by $m_{2}-(m_{1})^{2}$ , and these estimators are actually called the method of moments estimator.

To be more precise, we have the following the definition of the method of moments:

Definition. (Method of moments) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution with pdf or pmf $f(x;\theta _{1},\dotsc ,\theta _{k})$ . Write $k$ moment(s), e.g. $\mu _{1},\dotsc ,\mu _{k}$ , as function(s) of $\theta _{1},\dotsc ,\theta _{k}$ : $g_{1}(\theta _{1},\dotsc ,\theta _{k}),\dotsc ,g_{k}(\theta _{1},\dotsc ,\theta _{k})$ respectively. Then, the method of moments estimator (MME) of $\theta _{1},\dotsc ,\theta _{k}$ , ${\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k}$ respectively, is given by the solution (in the form of ${\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k}$ in terms of $m_{1},\dotsc ,m_{k}$ , corresponding to the $k$ moments $\mu _{1},\dotsc ,\mu _{k}$ ) to the following system of equations: ${\begin{cases}m_{1}=g_{1}({\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k})\\\vdots \\m_{k}=g_{k}({\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k})\\\end{cases}}$

Remark.

When there are $k$ unknown parameters, we need to solve a system of $k$ equations, involving $k$ sample moments.
Usually, we select the first $k$ moments for the $k$ moments, as in the definition. But this is not necessary, and we may choose other moments, including fractional moments (e.g. $\mathbb {E} [X^{1/2}]$ , and we use $m_{1/2}$ in this case).

Because of this, the method of moment estimator is not unique.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Find the MME of $\mu$ and $\sigma ^{2}$ .

Solution: First, there are two unknown parameters. Thus, we need to solve aa system of 2 equations, involving 2 sample moments and 2 moments. Since $\mu _{1}=\mu$ and $\mu _{2}=\sigma ^{2}+(\mu )^{2}$ , consider the following system of equations: ${\begin{cases}m_{1}={\hat {\mu }}&(1)\\m_{2}={\widehat {\sigma ^{2}}}+({\hat {\mu }})^{2}&(2)\\\end{cases}}$ Substituting $(1)$ into $(2)$ , we get $m_{2}={\widehat {\sigma ^{2}}}+(m_{1})^{2}\Leftrightarrow {\widehat {\sigma ^{2}}}=m_{2}-(m_{1})^{2}$ . Hence, the MME of $\mu$ is ${\hat {\mu }}=m_{1}$ and the MME of $\sigma ^{2}$ is ${\widehat {\sigma ^{2}}}=m_{2}-(m_{1})^{2}$ .

Remark.

We can see that the process of finding the MME of $\mu$ and $\sigma ^{2}$ is much easier than finding the MLE of $\mu$ and $\sigma ^{2}$ . This is because the expression of the first and second moment in terms of parameters is simple in this case. However, when the expression is more complicated, finding the MME of the parameters can be quite complicated.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the exponential distribution with rate parameter $\lambda$ . Find the MME of $\lambda$ and compare it to the MLE of $\lambda$ .

Solution: Since $\mu _{1}={\frac {1}{\lambda }}$ , consider the following equation: $m_{1}={\frac {1}{\hat {\lambda }}}$ . We then have ${\hat {\lambda }}={\frac {1}{m_{1}}}$ . Hence, the MME of $\lambda$ is ${\hat {\lambda }}={\frac {1}{m_{1}}}={\frac {1}{\overline {X}}}$ , which is somehow the same as the MLE of $\lambda$ .

Exercise. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the uniform distribution ${\mathcal {U}}[a,b]$ . Show that the MMEs of $a$ and $b$ are ${\hat {a}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}$ and ${\hat {b}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}$ respectively.

Solution

Proof. Since $\mu _{1}={\frac {a+b}{2}}$ and $\mu _{2}={\frac {(b-a)^{2}}{12}}+\left({\frac {a+b}{2}}\right)^{2}={\frac {b^{2}-2ab+a^{2}+3a^{2}+6ab+3b^{2}}{12}}={\frac {4a^{2}+4b^{2}+4ab}{12}}={\frac {a^{2}+b^{2}+ab}{3}}$ , consider the following system of equations: ${\begin{cases}m_{1}=({\hat {a}}+{\hat {b}})/2&(1)\\m_{2}=({\hat {a}}^{2}+{\hat {a}}{\hat {b}}+{\hat {b}}^{2})/3&(b)\\\end{cases}}$ From $(1)$ , we have ${\hat {b}}=2m_{1}-{\hat {a}}$ . Substituting it into $(2)$ , we have $m_{2}={\big (}{\hat {a}}^{2}+{\hat {a}}(2m_{1}-{\hat {a}})+(2m_{1}-{\hat {a}})^{2}{\big )}/3-{\big (}{\hat {a}}^{2}+2m_{1}{\hat {a}}-{\hat {a}}^{2}+4m_{1}^{2}-4m_{1}{\hat {a}}+{\hat {a}}^{2}{\big )}/3\Leftrightarrow {\hat {a}}^{2}-2m_{1}{\hat {a}}+4m_{1}^{2}=3m_{2}\Leftrightarrow {\hat {a}}^{2}-2m_{1}{\hat {a}}+4m_{1}^{2}-3m_{2}=0$ Solving this equation by quadratic formula, we get ${\hat {a}}={\frac {2m_{1}\pm {\sqrt {4m_{1}^{2}-4(4m_{1}^{2}-3m_{2})}}}{2}}={\frac {2m_{1}\pm {\sqrt {12m_{2}-12m_{1}^{2}}}}{2}}={\frac {2m_{1}\pm 2{\sqrt {3(m_{2}-m_{1}^{2})}}}{2}}=m_{1}\pm {\sqrt {3(m_{2}-m_{1}^{2})}}$ .

When ${\hat {a}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}$ , ${\hat {b}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}<{\hat {a}}$ . However, from the definition of the uniform distribution, we need to have ${\hat {a}}<{\hat {b}}$ , and thus this case is rejected.

When ${\hat {a}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}$ , ${\hat {b}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}>{\hat {a}}$ , which satisfies the definition of the uniform distribution.

Thus, we have the desired result.

$\Box$

Properties of estimator

In this section, we will introduce some criteria for evaluating how "good" a point estimator is, namely unbiasedness, efficienecy and consistency.

Unbiasedness

For ${\hat {\theta }}$ to be a "good" estimator of a parameter $\theta$ , a desirable property of ${\hat {\theta }}$ is that its expected value equals the value of the parameter $\theta$ , or at least close to the value. Because of this, we introduce a value, namely bias, to measure how close is the mean of ${\hat {\theta }}$ to $\theta$ .

Definition. (Bias) The bias of an estimator ${\hat {\theta }}$ is $\operatorname {Bias} ({\hat {\theta }})=\mathbb {E} [{\hat {\theta }}]-\theta .$

We will also define some terms related to bias.

Definition. ((Un)biased estimator) An estimator ${\hat {\theta }}$ is an unbiased estimator of a parameter $\theta$ if $\operatorname {Bias} ({\hat {\theta }})=0$ . Otherwise, the estimator is called a biased estimator.

Definition. (Asymptotically unbiased estimator) An estimator ${\hat {\theta }}$ is an asymptotically unbiased estimator of a parameter $\theta$ if $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ where $n$ is the sample size.

Remark.

An unbiased estimator must be an asymptotically unbiased estimator, but the converse is not true, i.e., an asymptotically unbiased estimator may not be an unbiased estimator. Thus, a biased estimator may be an asymptotically unbiased estimator.
When we discuss the goodness of estimators in terms of unbiasedness, an unbiased estimator is better than an asymptotically unbiased estimator, which is better than an unbiased estimator.

However, there are also other criteria for evaluating the goodness of estimators apart from unbiasedness, so when we also account for other criteria, a biased estimator may be somehow "better" than an unbiased estimator overall.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the Bernoulli distribution with success probability $p$ . Show that the MLE of $p$ , ${\overline {X}}$ , is an unbiased estimator of $p$ .

Proof. Since $\mathbb {E} [{\overline {X}}]={\frac {1}{n}}\cdot \mathbb {E} \left[\sum _{i=1}^{n}X_{i}\right]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\cdot \sum _{i=1}^{n}p={\frac {np}{n}}=p$ , the result follows.

$\Box$

Exercise. Suppose the Bernoulli distribution is replaced by binomial distribution with $n$ trials and success probability $p$ . Show that ${\overline {X}}$ is a biased estimator of $p$ . Modify this estimator such that it is an unbiased estimator of $p$ .

Solution

Proof. Since $\mathbb {E} [{\overline {X}}]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\sum _{i=1}^{n}np=np\neq p$ , ${\overline {X}}$ is a biased estimator of $p$ .

$\Box$

We can modify this estimator to ${\frac {\overline {X}}{n}}$ , and then its mean is ${\frac {np}{n}}=p$ . Alternatively, we may choose the estimator to be ${\frac {X_{i}}{n}}$ ( $i\in \{1,\dotsc ,n\}$ ), whose mean is also $p$ (Other estimators whose mean is $p$ are also fine).

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Show that the MLE of $\mu$ , ${\overline {X}}$ , is an unbiased estimator of $\mu$ , and the MLE of $\sigma ^{2}$ , $S^{2}$ , is an asymptotically unbiased estimator of $\sigma ^{2}$ .

Proof. First, since $\mathbb {E} [{\overline {X}}]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\sum _{i=1}^{n}\mu =\mu$ , ${\overline {X}}$ is an unbiased estimator of $\mu$ .

On the other hand, ${\begin{aligned}\mathbb {E} [S^{2}]&={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} \left[(X_{i}-{\overline {X}})^{2}\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left(X_{i}-{\overline {X}}\right)&{\text{since }}\mathbb {E} [X_{i}-{\overline {X}}]=\mathbb {E} [X_{i}]-\mathbb {E} [{\overline {X}}]=\mu -\mu =0\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left(X_{i}-{\frac {X_{1}+\dotsb +X_{i-1}+X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {{\color {blue}n}X_{i}}{n}}-{\frac {X_{1}+\dotsb +X_{i-1}+X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+({\color {blue}n}-1)X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}+{\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}\right)+\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}}{n}}\right)\right]&{\text{by independence}}\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}\right)+\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}}{n}}\right)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)^{2}}{n^{2}}}\sigma ^{2}+{\frac {1}{n^{2}}}\operatorname {Var} \left(X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}\right)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)^{2}}{n^{2}}}\sigma ^{2}+{\frac {n-1}{n^{2}}}\sigma ^{2}\right]&{\text{by iid}}\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {\sigma ^{2}}{n^{2}}}(n^{2}-2n+1+n-1)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n^{2}-n)\sigma ^{2}}{n^{2}}}\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)\sigma ^{2}}{n}}\right]\\&={\frac {1}{n}}\cdot n\cdot {\frac {(n-1)\sigma ^{2}}{n}}\\&={\frac {n-1}{n}}\sigma ^{2}\\\end{aligned}}$ Thus, $\lim _{n\to \infty }\mathbb {E} [S^{2}]=\lim _{n\to \infty }\left({\frac {n-1}{n}}\sigma ^{2}\right)=\sigma ^{2}\lim _{n\to \infty }\left(1-{\frac {1}{n}}\right)=\sigma ^{2}\left(1-\lim _{n\to \infty }{\frac {1}{n}}\right)=\sigma ^{2}$ , as desired.

$\Box$

Exercise. Modify the estimator of $\sigma ^{2}$ such that it becomes an unbiased estimator.

Solution

The estimator can be modified as ${\frac {n}{n-1}}S^{2}={\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n-1}}$ .

Efficiency

We have discussed how to evaluate the unbiasedness of estimators. Now, if we are given two unbiased estimators, ${\hat {\theta }}$ and ${\tilde {\theta }}$ , how should we compare their goodness? Their goodness is the same if we are only comparing them in terms of unbiasedness. Therefore, we need another criterion in this case. One possible way is to compare their variances, and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense. Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other. This is the idea of efficiency.

Definition. (Efficiency) Suppose ${\color {blue}{\hat {\theta }}}$ and ${\color {red}{\tilde {\theta }}}$ are two unbiased estimators of an unknown parameter $\theta$ . The efficiency of ${\color {blue}{\hat {\theta }}}$ relative to ${\color {red}{\tilde {\theta }}}$ is $\operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})={\frac {\operatorname {Var} ({\color {red}{\tilde {\theta }}})}{\operatorname {Var} ({\color {blue}{\hat {\theta }}})}}$ . If $\operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})>1$ , then we say that ${\color {blue}{\hat {\theta }}}$ is relatively more efficient than ${\color {red}{\tilde {\theta }}}$ .

Remark.

Since $\operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})>1\Leftrightarrow \operatorname {Var} ({\color {blue}{\hat {\theta }}})<\operatorname {Var} ({\color {red}{\tilde {\theta }}})$ , the estimator with smaller variance is relatively more efficient than the estimator with larger variance.
Normally, the variance should be nonzero, and thus the efficiency should be defined in normal cases.
Sometimes, it is also called relative efficiency due to the fact that the efficiency describes $\operatorname {Var} ({\color {red}{\tilde {\theta }}})$ equals "how many" $\operatorname {Var} ({\color {blue}{\hat {\theta }}})$ .
One may ask that why we use the ratio of variances is used in the definition to compare variances, instead of using the difference in variances. A possible reason is that the ratio of variances does not have any unit (the unit of the variances (if exists) cancels out each other), but the difference in variances can have an unit. Also, using the ratio of variances allows us to also compare different efficiencies numerically, calculated from different variances.

Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter $\theta$ , it measures the mean of the squared deviation from $\theta$ , and we have a specific term for this deviation, namely mean squared error (MSE).

Definition. (Mean squared error) Suppose ${\hat {\theta }}$ is an estimator of a parameter $\theta$ . The mean squared error (MSE) of ${\hat {\theta }}$ is $\operatorname {MSE} ({\hat {\theta }})=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]$ .

Remark.

From this definition, $\operatorname {MSE} ({\hat {\theta }})$ is the mean value of the square of error ${\hat {\theta }}-\theta$ , and hence the name mean squared error.

Notice that in the definition of MSE, we do not specify that ${\hat {\theta }}$ to be an unbiased estimator. Thus, ${\hat {\theta }}$ in the definition may be biased. We have mentioned that when ${\hat {\theta }}$ is unbiased, then its variance is actually its MSE. In the following, we will give a more general relationship between $\operatorname {MSE} ({\hat {\theta }})$ and $\operatorname {Var} ({\hat {\theta }})$ , not just for unbiased estimators.

Proposition. (Relationship between mean squared error and variance) If $\operatorname {Var} ({\hat {\theta }})$ exists, then $\operatorname {MSE} ({\hat {\theta }})=\operatorname {Var} ({\hat {\theta }})+[\operatorname {Bias} ({\hat {\theta }})]^{2}$ .

Proof. By definition, we have $\operatorname {MSE} ({\hat {\theta }})=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]$ and $\operatorname {Var} ({\hat {\theta }})=\mathbb {E} \left[({\hat {\theta }}-\mathbb {E} [{\hat {\theta }}])^{2}\right]$ . From these, we are motivated to write ${\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]\\&=\mathbb {E} \left[{\big (}({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})+({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta ){\big )}^{2}\right]\\&=\mathbb {E} [({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})^{2}+2({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})\underbrace {({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )} _{\text{constant}}+({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )^{2}]\\&=\operatorname {Var} ({\hat {\theta }})+2({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )\underbrace {\mathbb {E} [{\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}]} _{=\mathbb {E} [{\hat {\theta }}]-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}=0}+[\operatorname {Bias} ({\hat {\theta }})]^{2}\\&=\operatorname {Var} ({\hat {\theta }})+[\operatorname {Bias} ({\hat {\theta }})]^{2},\end{aligned}}$ as desired.

$\Box$

Example. Let $X_{1},\dotsc ,X_{n}$ ( $n>1$ ) be a random sample from ${\mathcal {N}}(\mu ,\sigma ^{2})$ .

(a) Show that the single observation estimator $X_{1}$ is an unbiased estimator for $\mu$ .

(b) Calculate the MSE of $X_{1}$ and ${\overline {X}}$ respectively.

(c) Which of $X_{1}$ and ${\overline {X}}$ is a better estimator of $\mu$ in terms of unbiasedness and efficiency?

Solution:

(a) Since $\mathbb {E} [X_{1}]=\mu$ , the result follows.

(b) $\operatorname {MSE} (X_{1})=\operatorname {Var} (X_{1})+0^{2}=\sigma ^{2}$ , and $\operatorname {MSE} ({\overline {X}})=\operatorname {Var} ({\overline {X}})={\frac {1}{n^{2}}}\sum _{i=1}^{n}\operatorname {Var} (X_{i})={\frac {n\sigma ^{2}}{n^{2}}}={\frac {\sigma ^{2}}{n}}$ .

(c) Since $\operatorname {MSE} ({\overline {X}})<\operatorname {MSE} (X_{1})\Leftrightarrow \operatorname {Var} ({\overline {X}})<\operatorname {Var} (X_{1})$ , ${\overline {X}}$ is relatively more efficient than $X_{1}$ . Since both $X_{1}$ and ${\overline {X}}$ are unbiased estimators of $\mu$ , we conclude that ${\overline {X}}$ is a better estimator of $\mu$ in terms of unbiasedness and efficiency.

Exercise. In addition to the random sample with sample size $n$ in the example, suppose we take another random sample with sample size $m$ . Let ${\overline {X}}^{(n)}$ and ${\overline {X}}^{(m)}$ denote the sample mean for the sample with sample size $n$ and $m$ respectively.

(a) Calculate $\operatorname {Eff} \left({\overline {X}}^{(n)},{\overline {X}}^{(m)}\right)$ .

(b) State the condition on the sample sizes $m$ and $n$ under which ${\overline {X}}^{(m)}$ is relatively more efficient than ${\overline {X}}^{(n)}$ .

Solution

(a) Since $\operatorname {Var} \left({\overline {X}}^{(n)}\right)={\frac {\sigma ^{2}}{n}}$ (from example), and $\operatorname {Var} \left({\overline {X}}^{(m)}\right)={\frac {\sigma ^{2}}{m}}$ (by similar arguments as in the example), $\operatorname {Eff} \left({\overline {X}}^{(n)},{\overline {X}}^{(m)}\right)={\frac {\sigma ^{2}/m}{\sigma ^{2}/n}}={\frac {n}{m}}$ .

(b) Since ${\frac {n}{m}}>1\Leftrightarrow n>m$ , the condition is $n>m$ .

Remark.

This shows that the sample mean with a larger sample size is relatively more efficient than the one with smaller sample size.

Proposition. $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0$ if and only if $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0$ and $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ .

Proof.

"if" part is simple. Assume $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0$ and $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ . Then, $\lim _{n\to \infty }(\operatorname {Var} ({\hat {\theta }})+(\operatorname {Bias} ({\hat {\theta }}))^{2})=0\Rightarrow \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0$ .
"only if" part: we can use proof by contrapositive, i.e., proving that if $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\neq 0$ or $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ , then $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})\neq 0$ .

Case 1: when $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\neq 0$ , it means $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})>0$ since the variance is nonnegative. Also, $\lim _{n\to \infty }(\operatorname {Bias} ({\hat {\theta }}))^{2}\geq 0$ . It follows that $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})>0$ , i.e., the MSE does not equal zero.
Case 2: when $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})\neq 0$ , it means $\lim _{n\to \infty }(\operatorname {Bias} ({\hat {\theta }}))^{2}>0$ . Also, $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\geq 0$ . It follows that $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})>0$ , i.e., the MSE does not equal zero.

$\Box$

Remark.

As a result, if we know that $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0$ , then we know that $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ , i.e., ${\hat {\theta }}$ is an asymptotically unbiased estimator (in addition to $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0$ ) ( ${\hat {\theta }}$ may be an unbiased estimator).

Uniformly minimum-variance unbiased estimator

Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is. Thus, it is natural that we want to know what is the most efficient (i.e., the "best") unbiased estimator, i.e., the unbiased estimator with the smallest variance. We have a specific name for such unbiased estimator, namely uniformly minimum-variance unbiased estimator (UMVUE) ^[6]. To be more precise, we have the following definition for UMVUE:

Definition. (Uniformly minimum-variance unbiased estimator) The uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator with the smallest variance among all unbiased estimators.

Indeed, UMVUE is unique, i.e., there is exactly one unbiased estimator with the smallest variance among all unbiased estimators, and we will prove it in the following.

Proposition. (Uniqueness of UMVUE) If $W$ is an UMVUE of a function of a parameter $\tau (\theta )$ , then $W$ is unique.

Proof. Assume that $W$ is an UMVUE of $\tau (\theta )$ , and $W'$ is another UMVUE of $\tau (\theta )$ . Define the estimator $W^{*}={\frac {1}{2}}(W+W')$ . Since $\mathbb {E} [W^{*}]={\frac {1}{2}}(\mathbb {E} [W]+\mathbb {E} [W'])={\frac {1}{2}}(\tau (\theta +\theta )=\tau (\theta )$ , $W^{*}$ is an unbiased estimator of $\tau (\theta )$ .

Now, we consider the variance of $W^{*}$ . ${\begin{aligned}\operatorname {Var} (W^{*})&={\frac {1}{4}}\operatorname {Var} (W+W')\\&={\frac {1}{4}}\left[\operatorname {Var} (W)+\operatorname {Var} (W')+2\operatorname {Cov} (W,W')\right]\\&\leq {\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{4}}\operatorname {Var} (W')+{\frac {1}{2}}{\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}&({\text{covariance inequality}})\\&={\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{2}}{\sqrt {(\operatorname {Var} (W))^{2}}}&(\operatorname {Var} (W)=\operatorname {Var} (W'){\text{ since }}W{\text{ and }}W'{\text{ are both UMVUE}})\\&={\frac {1}{2}}\operatorname {Var} (W)+{\frac {1}{2}}\operatorname {Var} (W)&(\operatorname {Var} (W)>0)\\&=\operatorname {Var} (W).\end{aligned}}$ Thus, we now have either $\operatorname {Var} (W^{*})<\operatorname {Var} (W)$ or $\operatorname {Var} (W^{*})=\operatorname {Var} (W)$ . If the former is true, then $W$ is not an UMVUE of $\tau (\theta )$ by definition, since we can find another unbiased estimator, namely $W^{*}$ , with smaller variance than it. Hence, we must have the latter, i.e., $\operatorname {Var} (W^{*})=\operatorname {Var} (W).$ This implies when we apply the covariance inequality, the equality holds, i.e., $\operatorname {Cov} (W,W')={\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}\iff \rho (W',W)=1,$ which means $W'$ is increasing linearly with $W$ , i.e., we can write $W'=aW+b$ for some constants $a>0$ and $b$ .

Now, we consider the covariance $\operatorname {Cov} (W,W')$ . $\operatorname {Cov} (W,W'){\overset {\text{ above }}{=}}\operatorname {Cov} (W,aW+b){\overset {\text{ properties }}{=}}a\operatorname {Cov} (W,W){\overset {\text{ property }}{=}}a\operatorname {Var} (W).$ On the other hand, since the equality holds in the covariance inequality, and $\operatorname {Var} (W)=\operatorname {Var} (W')$ (since they are both UMVUE), $\operatorname {Cov} (W,W')={\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}={\sqrt {(\operatorname {Var} (W))^{2}}}=\operatorname {Var} (W).$ Thus, we have $a=1$ .

It remains to show that $b=0$ to prove that $W=W'$ , and therefore conclude that $W$ is unique.

From above, we currently have $W'=W+b\implies \mathbb {E} [W']=\mathbb {E} [W]+b\implies \tau (\theta )=\tau (\theta )+b\implies b=0$ , as desired.

$\Box$

Remark.

Thus, when we are able to find an UMVUE, then it is the unique one, and the variance every other possible unbiased estimator is strictly greater than the variance of the UMVUE.

Cramer-Rao lower bound

Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.

Therefore, we will introduce some approaches that help us to find the UMVUE. For the first approach, we find a lower bound ^[7] on the variances of all possible unbiased estimators. After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.

Remark.

There are many possible lower bounds, but when the lower bound is greater, it is closer to the actual minimum value of the variances, and hence "better".
An unbiased estimator can still be an UMVUE even if its variance does not achieve the lower bound.

A common way to find such lower bound is to use the Cramer-Rao lower bound (CRLB), and we get the CRLB through Cramer-Rao inequality. Before stating the inequality, let us define some related terms.

Definition. (Fisher information) The Fisher information about a parameter $\theta$ with sample size $n$ is ${\mathcal {I}}_{n}(\theta )=\mathbb {E} \left[\left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right)^{2}\right]$ where $\ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )=\ln {\mathcal {L}}({\boldsymbol {\theta }};X_{1},\dotsc ,X_{n})$ is the log-likelihood function (as a random variable).

Remark.

${\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}$ is called the score function, and is denoted by $S(\theta ;\mathbf {X} )$ .
The " ${\boldsymbol {\theta }}$ " may or may not be a parameter vector. If it is just a single parameter (usually the case here), then it is the same as " $\theta$ ". We use " ${\boldsymbol {\theta }}$ " instead of " $\theta$ " to emphasize that the " $\theta$ " in ${\mathcal {I}}_{n}(\theta )$ and $S(\theta ;\mathbf {X} )$ is referring to the " $\theta$ " in " ${\frac {\partial }{\partial \theta }}$ "
It is possible to define "Fisher information about a parameter vector", but in this case the Fisher information takes the form of a matrix instead of a single number, and it is called Fisher information matrix. However, since it is more complicated, we will not discuss it here.
Since the expected value of the score function

$\mathbb {E} [S(\theta ;\mathbf {X} )]\mathbb {E} \left[{\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right]=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\cdot {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\frac {\partial {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}{{\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}}\cdot {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\partial {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\,dx_{n}\cdots \,dx_{1},$

and under some regularity conditions which allow interchange of derivative and integral, this equals

{\frac {\partial }{\partial \theta }}\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}={\frac {\partial }{\partial \theta }}(1)=0

, the Fisher information about

\theta

is also the variance of the score function, i.e.,

\operatorname {Var} (S(\theta ;\mathbf {X} ))=\operatorname {Var} \left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right)

.

For the regularity conditions which allow interchange of derivative and integral, they include

the partial derivatives involved should exist, i.e., the (natural log) of the functions involved is differentiable
the integrals involved should be differentiable
the support does not depend on the parameter(s) involved

We have some results that assist us to compute the Fisher information.

Proposition. Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution with pdf or pmf $f$ . Also, let ${\mathcal {I}}(\theta )=\mathbb {E} \left[\left({\frac {\partial \ln f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)^{2}\right]$ , the Fisher information about $\theta$ with sample size one. Then, under some regularity conditions which allow interchange of derivative and integral, ${\mathcal {I}}_{n}(\theta )=n{\mathcal {I}}(\theta )$ .

Proof. ${\begin{aligned}{\mathcal {I}}_{n}(\theta )&=\mathbb {E} \left[\left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\right)^{2}\right]\\&=\operatorname {Var} \left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\right)&{\text{by above remark}}\\&=\operatorname {Var} \left({\frac {\partial }{\partial \theta }}\left(\ln \prod _{i=1}^{n}f(X_{i};{\boldsymbol {\theta }})\right)\right)&\left({\mathcal {L}}(\theta ;\mathbf {x} )=\prod _{i=1}^{n}f(x_{i};\theta )\right)\\&=\operatorname {Var} \left({\frac {\partial }{\partial \theta }}\left(\sum _{i=1}^{n}\ln f(X_{i};{\boldsymbol {\theta }})\right)\right)\\&=\operatorname {Var} \left(\sum _{i=1}^{n}{\frac {\partial }{\partial \theta }}\ln f(X_{i};{\boldsymbol {\theta }})\right)&{\text{by linearity of differentiation}}\\&=\sum _{i=1}^{n}\operatorname {Var} \left({\frac {\partial }{\partial \theta }}\ln f(X_{i};{\boldsymbol {\theta }})\right)&{\text{by independence}}\\&=n\operatorname {Var} \left({\frac {\partial }{\partial \theta }}\ln f(X_{i};{\boldsymbol {\theta }})\right)&{\text{by identically distributed property}}\\&=n\mathbb {E} \left[\left({\frac {\partial \ln f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)^{2}\right]&{\text{by above remark}}\\&=n{\mathcal {I}}(\theta ).\end{aligned}}$

$\Box$

Proposition. Under some regularity conditions which allow interchange of derivative and integral, ${\mathcal {I}}(\theta )=-\mathbb {E} \left[{\frac {\partial ^{2}\ln f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]$ .

Proof. ${\begin{aligned}\mathbb {E} \left[{\frac {\partial ^{2}\ln f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]&=\mathbb {E} \left[{\frac {\partial }{\partial \theta }}\left({\frac {\partial \ln f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)\right]\\&=\mathbb {E} \left[{\frac {\partial }{\partial \theta }}\left({\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)\right]\\&=\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}-{\frac {\partial f(X;{\boldsymbol {\theta }})}{\partial \theta }}\cdot {\frac {1}{(f(X;{\boldsymbol {\theta }}))^{2}}}\cdot {\frac {\partial f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right]\\&=\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}-\left({\frac {\partial f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)^{2}\cdot {\frac {1}{(f(X;{\boldsymbol {\theta }}))^{2}}}\right]\\&=\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]-\mathbb {E} \left[\left({\frac {\partial \ln f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)^{2}\right]\\&=\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]-{\mathcal {I}}(\theta )\\\end{aligned}}$ Now, it suffices to prove that $\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]=0$ , which is true since ${\begin{aligned}\mathbb {E} \left[{\frac {1}{f(X;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(X;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\right]&=\int _{-\infty }^{\infty }{\frac {1}{f(x;{\boldsymbol {\theta }})}}\cdot {\frac {\partial ^{2}f(x;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\cdot f(x;{\boldsymbol {\theta }})\,dx\\&=\int _{-\infty }^{\infty }{\frac {\partial ^{2}f(x;{\boldsymbol {\theta }})}{\partial \theta ^{2}}}\,dx\\&={\frac {\partial ^{2}}{\partial \theta ^{2}}}\int _{-\infty }^{\infty }f(x;{\boldsymbol {\theta }})\,dx\\&={\frac {\partial ^{2}}{\partial \theta ^{2}}}(1)\\&=0.\\\end{aligned}}$

$\Box$

Remark.

This proposition can be quite useful, since after partially differentiating $\ln f(X;{\boldsymbol {\theta }})$ , it is likely that many $X$ 's will vanish, and thus the computation of the expectation will be easier.

Theorem. (Cramer-Rao inequality) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution, and let $W$ be an unbiased estimator of $\tau (\theta )$ (a function of $\theta$ ). Then, under some regularity conditions which allow interchange of derivative and integral, $\operatorname {Var} (W)\geq {\frac {(\tau '(\theta ))^{2}}{{\mathcal {I}}_{n}(\theta )}}$ .

Proof. Since $W$ is an unbiased estimator of $\tau (\theta )$ , we have by definition $\mathbb {E} [W]=\tau (\theta )$ . By definition of expectation, we have $\mathbb {E} [W]=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }w{\mathcal {L}}(\theta ;\mathbf {x} )\,dx_{n}\cdots \,dx_{1}$ where ${\mathcal {L}}(\theta ;\mathbf {x} )$ is the likelihood function. Thus, ${\begin{aligned}&&\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }w{\mathcal {L}}(\theta ;\mathbf {x} )\,dx_{n}\cdots \,dx_{1}&=\tau (\theta )\\&\Rightarrow &{\frac {\partial }{\partial \theta }}\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }w{\mathcal {L}}(\theta ;\mathbf {x} )\,dx_{n}\cdots \,dx_{1}&={\frac {\partial }{\partial \theta }}\tau (\theta )\\&\Rightarrow &\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\partial }{\partial \theta }}\left(w{\mathcal {L}}(\theta ;\mathbf {x} )\right)\,dx_{n}\cdots \,dx_{1}&=\tau '(\theta )\\&\Rightarrow &\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }w{\frac {\partial }{\partial \theta }}\left({\mathcal {L}}(\theta ;\mathbf {x} )\right)\cdot {\frac {1}{{\mathcal {L}}(\theta ;\mathbf {x} )}}\cdot {\mathcal {L}}(\theta ;\mathbf {x} )\,dx_{n}\cdots \,dx_{1}&=\tau '(\theta )\\&\Rightarrow &\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }w{\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta }}{\mathcal {L}}(\theta ;\mathbf {x} )\,dx_{n}\cdots \,dx_{1}&=\tau '(\theta )\\&\Rightarrow &\mathbb {E} \left[W\cdot {\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta }}\right]&=\tau '(\theta )\\&\Rightarrow &\mathbb {E} \left[WS(\theta ;\mathbf {X} )\right]&=\tau '(\theta )&\left(S(\theta ;\mathbf {X} )={\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta }}\right)\\&\Rightarrow &\mathbb {E} \left[WS(\theta ;\mathbf {X} )\right]-\mathbb {E} [W]\underbrace {\mathbb {E} [S(\theta ;\mathbf {X} )]} _{=0}&=\tau '(\theta )&(\mathbb {E} [S(\theta ;\mathbf {X} )]=0{\text{ by remark about Fisher information}})\\&\Rightarrow &\operatorname {Cov} (W,S(\theta ;\mathbf {X} ))&=\tau '(\theta )\\\end{aligned}}$ Consider the covariance inequality: $(\operatorname {Cov} (X,Y))^{2}\leq \operatorname {Var} (X)\operatorname {Var} (Y)$ . We have ${\big (}\operatorname {Cov} (W,S(\theta ;\mathbf {X} )){\big )}^{2}\leq \operatorname {Var} (W)\operatorname {Var} (S(\theta ;\mathbf {X} ))\implies (\tau '(\theta ))^{2}\leq \operatorname {Var} (W)\operatorname {Var} (S(\theta ;\mathbf {X} ))\implies \operatorname {Var} (W)\geq {\frac {(\tau '(\theta ))^{2}}{\operatorname {Var} (S(\theta ;\mathbf {X} ))}}={\frac {(\tau '(\theta ))^{2}}{{\mathcal {I}}_{n}(\theta )}}.$ ( ${\mathcal {I}}_{n}(\theta )=\operatorname {Var} (S(\theta ;\mathbf {X} ))$ by remark about Fisher information)

$\Box$

Remark.

${\frac {(\tau '(\theta ))^{2}}{{\mathcal {I}}_{n}(\theta )}}$ is called the Cramer-Rao lower bound (CRLB).
When $\tau (\theta )=\theta$ , meaning that $W$ is an unbiased estimator of $\theta$ , since $(\tau '(\theta ))^{2}=1^{2}=1$ , the CRLB becomes ${\frac {1}{{\mathcal {I}}_{n}(\theta )}}$ .

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Show that the MLE of $\mu$ , ${\overline {X}}$ , is an UMVUE of $\mu$ .

Proof. First, we can see that the regularity conditions are satisfied in this case. So, we may consider the CRLB of $\mu$ as follows. Since $\ln f(X;\mu ,\sigma ^{2})=\ln {\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(X-\mu )^{2}}{2\sigma ^{2}}}\right)=-{\frac {1}{2}}(\ln 2\pi \sigma ^{2})-{\frac {(X-\mu )^{2}}{2\sigma ^{2}}}=-{\frac {1}{2}}(\ln 2\pi \sigma ^{2})-{\frac {X^{2}-2\mu X+\mu ^{2}}{2\sigma ^{2}}},$ we have ${\mathcal {I}}(\mu )=-\mathbb {E} \left[{\frac {\partial ^{2}}{\partial \mu ^{2}}}\left(-{\frac {1}{2}}(\ln 2\pi \sigma ^{2})-{\frac {X^{2}-2\mu X+\mu ^{2}}{2\sigma ^{2}}}\right)\right]=-\mathbb {E} \left[{\frac {\partial }{\partial \mu }}\left(-{\frac {-2X+2\mu }{2\sigma ^{2}}}\right)\right]=-\mathbb {E} \left[-\left({\frac {2}{2\sigma ^{2}}}\right)\right]=-\mathbb {E} [\underbrace {-{\frac {1}{\sigma ^{2}}}} _{{\text{constant wrt }}X}]={\frac {1}{\sigma ^{2}}}.$ Thus, the CRLB of $\mu$ is ${\frac {1}{n{\mathcal {I}}(\mu )}}={\frac {1}{n(1/\sigma ^{2})}}={\frac {\sigma ^{2}}{n}}$ .

On the other hand, the variance of ${\overline {X}}$ is $\operatorname {Var} ({\overline {X}})={\frac {\sigma ^{2}}{n}}$ (which is shown in a previous example), which equals the CRLB of $\mu$ . It follows that ${\overline {X}}$ is an UMVUE of $\mu$ .

$\Box$

Exercise. A student claims that ${\frac {X_{1}}{\sqrt {n}}}$ is another UMVUE of $\mu$ , since $\operatorname {Var} \left({\frac {X_{1}}{\sqrt {n}}}\right)={\frac {\sigma ^{2}}{({\sqrt {n}})^{2}}}={\frac {\sigma ^{2}}{n}}$ , which equals the CRLB of $\mu$ as well. Is the claim correct? Why?

Solution

Recall that UMVUE is an unbiased estimator.

The claim is wrong, since ${\frac {X_{1}}{\sqrt {n}}}$ is not an unbiased estimator in general. This is because $\mathbb {E} \left[{\frac {X_{1}}{\sqrt {n}}}\right]={\frac {\mu }{\sqrt {n}}}\neq \mu$ unless $n=1$ . But if $n=1$ , then this estimator is simply $X_{1}$ , which is exactly the same as ${\overline {X}}={\frac {X_{1}}{1}}=X_{1}$ . So, the estimator is not another UMVUE in this case.

Sometimes, we cannot use the CRLB method for finding UMVUE, because

the regularity conditions may not be satisfied, and thus we cannot use the Cramer-Rao inequality, and
the variance of the unbiased estimator may not be equal to the CRLB, but we cannot conclude that it is not an UMVUE, because it may be the case that the CRLB is not attainable at all, and the smallest variance among all unbiased estimators is actually the variance of that estimator, which is larger than the CRLB.

We will illustrate some examples for these two cases in the following.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the uniform distribution ${\mathcal {U}}[0,\beta ]$ . If we want to find the UMVUE of $\beta$ , we cannot use the Cramer-Rao inequality to find it, since the support $[0,\beta ]$ depends on the parameter $\beta$ .

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ . It is given that in this case, ${\frac {nS^{2}}{\sigma ^{2}}}\sim \chi _{n-1}^{2}$ , where $\chi _{k}^{2}$ is the chi-squared distribution with $k$ degrees of freedom, and its variance is $2k$ . Calculate $\operatorname {Var} \left({\frac {n}{n-1}}\cdot S^{2}\right)$ , and the CRLB of $\sigma ^{2}$ .

Solution: By the given information, we have $\operatorname {Var} \left({\frac {nS^{2}}{\sigma ^{2}}}\right)=2(n-1)\implies {\frac {n^{2}}{\sigma ^{4}}}\operatorname {Var} (S^{2})=2(n-1)\implies \operatorname {Var} (S^{2})={\frac {2(n-1)\sigma ^{4}}{n^{2}}}.$ Hence, $\operatorname {Var} \left({\frac {n}{n-1}}\cdot S^{2}\right)={\frac {n^{2}}{(n-1)^{2}}}\cdot {\frac {2(n-1)\sigma ^{4}}{n^{2}}}={\frac {2\sigma ^{4}}{n-1}}$ .

On the other hand, since ${\begin{aligned}{\mathcal {I}}(\theta )&=-\mathbb {E} \left[{\frac {\partial ^{2}}{\partial (\sigma )^{2}}}\left(-{\frac {1}{2}}(\ln 2\pi \sigma ^{2})-{\frac {(X-\mu )^{2}}{2\sigma ^{2}}}\right)\right]\\&=-\mathbb {E} \left[{\frac {\partial }{\partial \sigma }}\left(-{\frac {4\pi \sigma }{2(2\pi \sigma ^{2}}}+{\frac {(X-\mu )^{2}}{\sigma ^{3}}}\right)\right]\\&=-\mathbb {E} \left[{\frac {\partial }{\partial \sigma }}\left(-{\frac {1}{\sigma }}+{\frac {(X-\mu )^{2}}{\sigma ^{3}}}\right)\right]\\&=-\mathbb {E} \left[{\frac {1}{\sigma ^{2}}}-{\frac {3(X-\mu )^{2}}{\sigma ^{4}}}\right]\\&=-{\frac {1}{\sigma ^{2}}}+{\frac {3}{\sigma ^{4}}}\mathbb {E} [(X-\mu )^{2}]\\&=-{\frac {1}{\sigma ^{2}}}+{\frac {3}{\sigma ^{4}}}\cdot \sigma ^{2}\\&={\frac {2}{\sigma ^{2}}},\\\end{aligned}}$ and $\left({\frac {d}{d\sigma }}\sigma ^{2}\right)^{2}=(2\sigma )^{2}=4\sigma ^{2}$ , the CRLB of $\sigma ^{2}$ is ${\frac {4\sigma ^{2}}{2n/\sigma ^{2}}}={\frac {2\sigma ^{4}}{n}}.$

Remark.

${\frac {n}{n-1}}\cdot S^{2}$ is an unbiased estimator of $\sigma ^{2}$ , since $\mathbb {E} \left[{\frac {n}{n-1}}\cdot S^{2}\right]={\frac {n}{n-1}}\left({\frac {n-1}{n}}\sigma ^{2}\right)=\sigma ^{2}$ .
We can observe that $\operatorname {Var} \left({\frac {n}{n-1}}\cdot S^{2}\right)$ is greater than CRLB. But does this mean ${\frac {n}{n-1}}\cdot S^{2}$ is not the UMVUE of $\sigma ^{2}$ ? We do not know, since we are not sure that whether there is another unbiased estimators with variance less than $\operatorname {Var} \left({\frac {n}{n-1}}\cdot S^{2}\right)$ , and it is possible that the CRLB is not attainable.

Since the CRLB is sometimes attainable and sometimes not, it is natural to question that when can the CRLB be attained. In other words, we would like to know the attainment conditions for the CRLB, which are stated in the following corollary.

Corollary. (Attainable condition for the CRLB) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution, and let $W$ be an unbiased estimator of $\tau (\theta )$ . Suppose the regularity conditions in the Cramer-Rao inequality are satisfied. Then, the CRLB can be attained, i.e., there exists some $W$ such that $\operatorname {Var} (W)={\frac {(\tau '(\theta ))^{2}}{{\mathcal {I}}_{n}(\theta )}}$ , if and only if $k(W-\tau (\theta ))=S(\theta ;\mathbf {X} )$ where $S(\theta ;\mathbf {X} )={\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {X} )}{\partial \theta }}$ is the score function, and $k$ is a constant.

Proof. Considering the proof for Cramer-Rao inequality, we have $\operatorname {Var} (W)={\frac {(\tau '(\theta ))^{2}}{{\mathcal {I}}_{n}(\theta )}}\iff (\operatorname {Cov} (W,S(\theta ;\mathbf {X} )))^{2}=\operatorname {Var} (W)\operatorname {Var} (S(\theta ;\mathbf {X} ))$ We can write $\operatorname {Cov} (W,S(\theta ;\mathbf {X} ))$ as $\operatorname {Cov} (W-\underbrace {\tau (\theta )} _{\text{constant}},S(\theta ;\mathbf {X} ))$ (by result about covariance). Also, $\operatorname {Var} (W)=\operatorname {Var} (W-\underbrace {\tau (\theta )} _{\text{constant}})$ (by result about variance). Thus, we have ${\begin{aligned}&&{\big (}\operatorname {Cov} (W-\tau (\theta ),S(\theta ;\mathbf {X} )){\big )}^{2}&=\operatorname {Var} (W-\tau (\theta ))\operatorname {Var} (S(\theta ;\mathbf {X} ))\\&\Leftrightarrow &{\frac {{\big (}\operatorname {Cov} (W-\tau (\theta ),S(\theta ;\mathbf {X} )){\big )}^{2}}{\operatorname {Var} (W-\tau (\theta ))\operatorname {Var} (S(\theta ;\mathbf {X} ))}}&=1\\&\Leftrightarrow &{\frac {{\big (}\operatorname {Cov} (S(\theta ;\mathbf {X} ),W-\tau (\theta )){\big )}^{2}}{\operatorname {Var} (W-\tau (\theta ))\operatorname {Var} (S(\theta ;\mathbf {X} ))}}&=1\\&\Leftrightarrow &{\big (}\rho (S(\theta ;\mathbf {X} ),W-\tau (\theta )){\big )}^{2}&=1\\&\Leftrightarrow &\rho (S(\theta ;\mathbf {X} ),W-\tau (\theta ))&=\pm 1\end{aligned}}$ where $\rho (\cdot ,\cdot )$ is the correlation coefficient between two random variables. This means $S(\theta ;\mathbf {X} )$ increases or decreases linearly with $W-\tau (\theta )$ , i.e., $S(\theta ;\mathbf {X} )=k(W-\tau (\theta ))+c$ for some constants $c,k$ . Now, it suffices to show that the constant $c$ is actually zero.

We know that $\mathbb {E} [W]=\tau (\theta )$ (since $W$ is an unbiased estimator of $\tau (\theta )$ ), and $\mathbb {E} [S(\theta ;\mathbf {X} )]=0$ (from remark about Fisher information). Thus, applying expectations on both side gives $\mathbb {E} [S(\theta ;\mathbf {X} )]=k\mathbb {E} [W-\tau (\theta )]+c\iff \mathbb {E} [S(\theta ;\mathbf {X} )]=k(\underbrace {\mathbb {E} [W]-\tau (\theta )} _{=0})+c\iff 0=0+c\iff c=0.$ Then, the result follows.

$\Box$

Remark.

Considering the proof, we know that if we have such attainable condition satisfied, the variance of the unbiased estimator $W$ equals the CRLB of $\tau (\theta )$ , i.e., that estimator is the UMVUE of $\tau (\theta )$ .

Example. We have shown that the log-likelihood function of a random sample $X_{1},\dotsc ,X_{n}$ from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ is $\ln {\mathcal {L}}(\mu ,\sigma ^{2})=-{\frac {n}{2}}\ln(2\pi \sigma ^{2})-\sum _{i=1}^{n}{\frac {(x_{i}-\mu )^{2}}{2\sigma ^{2}}}$ . Prove that the CRLB of $\mu$ is attainable using the attainable conditions for the CRLB.

Proof. The score function is $S(\mu )={\frac {\partial \ln {\mathcal {L}}(\mu ,\sigma ^{2})}{\partial \mu }}={\frac {\partial }{\partial \mu }}\left(-{\frac {n}{2}}\ln(2\pi \sigma ^{2})-\sum _{i=1}^{n}{\frac {({\color {darkgreen}X}_{i}-\mu )^{2}}{2\sigma ^{2}}}\right)=-\sum _{i=1}^{n}{\frac {-2{\color {darkgreen}X}_{i}+2\mu }{2\sigma ^{2}}}=\sum _{i=1}^{n}{\frac {{\color {darkgreen}X}_{i}-\mu }{\sigma ^{2}}}={\frac {1}{\sigma ^{2}}}\left(\sum _{i=1}^{n}X_{i}-\sum _{i=1}^{n}\mu \right)={\frac {n}{\sigma ^{2}}}({\overline {X}}-\mu ).$ Since we have $\tau (\mu )=\mu$ and ${\hat {\mu }}={\overline {X}}$ (which is an unbiased estimator of $\mu$ ), the attainable conditions for the CRLB are satisfied (the constant " $k$ " is ${\frac {n}{\sigma ^{2}}}$ in this case), and hence the CRLB of $\mu$ is attainable.

$\Box$

Remark.

Indeed, we know that the CRLB of $\mu$ is attainable before this proof since we have already found an unbiased estimator of $\mu$ , namely ${\overline {X}}$ , whose variance is exactly equal to the CRLB previously.

Example. Continue from the previous example. Prove that the CRLB of $\sigma ^{2}$ is not attainable using the attainable conditions for the CRLB.

Proof. The score function in this case is $S(\sigma )={\frac {\partial \ln {\mathcal {L}}(\mu ,\sigma ^{2})}{\partial \sigma }}={\frac {\partial }{\partial \sigma }}\left(-{\frac {n}{2}}\ln(2\pi \sigma ^{2})-\sum _{i=1}^{n}{\frac {({\color {darkgreen}X}_{i}-\mu )^{2}}{2\sigma ^{2}}}\right)=-{\frac {n}{2}}{\frac {4\pi \sigma }{2\pi \sigma ^{2}}}+{\frac {2}{2\sigma ^{3}}}\sum _{i=1}^{n}({\color {darkgreen}X_{i}}-\mu )^{2}=-{\frac {n}{\sigma }}+{\frac {1}{\sigma ^{3}}}\sum _{i=1}^{n}({\color {darkgreen}X_{i}}-\mu )^{2}=\underbrace {\frac {n}{\sigma ^{3}}} _{\text{constant}}\left(\sum _{i=1}^{n}{\frac {({\color {darkgreen}X_{i}}-\mu )^{2}}{n}}-\sigma ^{2}\right).$ Taking the constant $k={\frac {n}{\sigma ^{3}}}$ , a potential candidate for the unbiased estimator ${\hat {\sigma }}$ that achieves the CRLB is $\sum _{i=1}^{n}{\frac {({\color {darkgreen}X_{i}}-\mu )^{2}}{n}}$ . However, we notice that $\sum _{i=1}^{n}{\frac {({\color {darkgreen}X_{i}}-\mu )^{2}}{n}}$ is not calculable since $\mu$ is unknown. It follows that there does not exist some $W$ such that $S(\sigma )=k(W-\tau (\sigma ))$ , where $k$ is some constant and $\tau (\sigma )=\sigma ^{2}$ .

$\Box$

Remark.

Even if we know that the CRLB of $\sigma ^{2}$ is not attainable, we still do not know whether ${\frac {n}{n-1}}\cdot S^{2}$ is the UMVUE, since it is possible that some unbiased estimator with smaller variance (but not achieving the CRLB).

We have discussed MLE previously, and MLE is actually a "best choice" asymptotically (i.e., as the sample size $n\to \infty$ ) according to the following theorem.

Theorem. Suppose ${\hat {\theta }}$ is the MLE of an unknown parameter $\theta$ from a distribution. Then, under some regularity conditions, as $n\to \infty$ , ${\frac {{\hat {\theta }}-\theta }{\sqrt {1/{\mathcal {I}}_{n}(\theta )}}}\;{\overset {d}{\to }}\;{\mathcal {N}}(0,1).$

Proof. Partial proof: we consider the Taylor series of order 2 for ${\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )$ , and we will get ${\frac {d}{d\theta }}\ln {\mathcal {L}}({\hat {\theta }})={\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )+({\hat {\theta }}-\theta ){\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )+{\frac {1}{2}}({\hat {\theta }}-\theta )^{2}{\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}$ where $\theta ^{*}$ is between $\theta$ and ${\hat {\theta }}$ . Since ${\hat {\theta }}$ is the MLE of $\theta$ , from the derivative test, we know that ${\frac {d}{d\theta }}\ln {\mathcal {L}}({\hat {\theta }})=0$ (we apply regularity condition to ensure the existence of this derivative). Hence, we have ${\begin{aligned}&&{\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )+({\hat {\theta }}-\theta ){\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )+{\frac {1}{2}}({\hat {\theta }}-\theta )^{2}{\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}&=0\\&\Rightarrow &-{\sqrt {n}}({\hat {\theta }}-\theta ){\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )-{\frac {\sqrt {n}}{2}}({\hat {\theta }}-\theta )^{2}{\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}={\sqrt {n}}{\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )\\&\Rightarrow &{\sqrt {n}}({\hat {\theta }}-\theta )={\frac {{\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )/{\sqrt {n}}}{-n^{-1}{\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )-(2n)^{-1}({\hat {\theta }}-\theta ){\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}}}.\end{aligned}}$ Since $\operatorname {Var} \left(\sum _{i=1}^{n}{\frac {\partial \ln f(X_{i};\theta )}{\partial \theta }}\right)=\sum _{i=1}^{n}\operatorname {Var} \left({\frac {\partial \ln f(X_{i};\theta )}{\partial \theta }}\right)=\sum _{i=1}^{n}\mathbb {E} \left[\left({\frac {\partial \ln f(X_{i};\theta )}{\partial \theta }}\right)^{2}\right]=n{\mathcal {I}}(\theta )\qquad (1),$ by central limit theorem, ${\frac {{\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )}{\sqrt {n}}}={\frac {1}{\sqrt {n}}}\sum _{i=1}^{n}{\frac {\partial \ln f(X_{i};\theta )}{\partial \theta }}\;{\overset {d}{\to }}\;{\mathcal {N}}(0,(1/n)nI(\theta ))\equiv {\mathcal {N}}(0,{\mathcal {I}}(\theta )).$ Furthermore, we apply the weak law of large number to show that $-n^{-1}{\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )=-{\frac {1}{n}}\sum _{i=1}^{n}{\frac {\partial ^{2}\ln f(X_{i};\theta )}{\partial \theta ^{2}}}\;{\overset {p}{\to }}\;-\mathbb {E} \left[{\frac {\partial ^{2}\ln f(X_{i};\theta )}{\partial \theta ^{2}}}\right]={\mathcal {I}}(\theta )\qquad (2).$ It can be shown in a quite complicated way (and using regularity conditions) that $-(2n)^{-1}({\hat {\theta }}-\theta ){\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}\;{\overset {p}{\to }}\;0.\qquad (3).$ Considering $(2)$ and $(3)$ , using property of convergence in probability, we have $-n^{-1}{\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )-(2n)^{-1}({\hat {\theta }}-\theta ){\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}\;{\overset {p}{\to }}\;{\mathcal {I}}(\theta )+0={\mathcal {I}}(\theta )\qquad (4).$ Considering $(1)$ and $(4)$ , and using Slutsky's theorem, we have ${\sqrt {n}}({\hat {\theta }}-\theta )={\frac {{\frac {d}{d\theta }}\ln {\mathcal {L}}(\theta )/{\sqrt {n}}}{-n^{-1}{\frac {d^{2}}{d\theta ^{2}}}\ln {\mathcal {L}}(\theta )-(2n)^{-1}({\hat {\theta }}-\theta ){\frac {d^{3}}{d\theta ^{3}}}\ln {\mathcal {L}}(\theta ){\bigg \vert }_{\theta =\theta ^{*}}}}\;{\overset {d}{\to }}\;{\frac {Y}{{\mathcal {I}}(\theta )}}$ where $Y\sim {\mathcal {N}}(0,{\mathcal {I}}(\theta ))$ , and hence ${\frac {Y}{{\mathcal {I}}(\theta )}}\sim {\mathcal {N}}\left(0,{\frac {{\mathcal {I}}(\theta )}{[{\mathcal {I}}(\theta )]^{2}}}\right)\equiv {\mathcal {N}}(0,1/{\mathcal {I}}(\theta ))$ . It follows that ${\sqrt {n}}({\hat {\theta }}-\theta )\;{\overset {d}{\to }}\;{\mathcal {N}}(0,1/{\mathcal {I}}(\theta )).$ This means ${\hat {\theta }}-\theta \;{\overset {d}{\to }}\;{\mathcal {N}}(0,1/(n{\mathcal {I}}(\theta )))\equiv {\mathcal {N}}(0,1/{\mathcal {I}}_{n}(\theta )),$ and thus ${\frac {{\hat {\theta }}-\theta }{\sqrt {1/{\mathcal {I}}_{n}(\theta )}}}\;{\overset {d}{\to }}\;{\mathcal {N}}{\Bigg (}0,{\frac {1/(n{\mathcal {I}}(\theta ))}{1/\underbrace {{\mathcal {I}}_{n}(\theta )} _{=n{\mathcal {I}}(\theta )}}}{\Bigg )}\equiv {\mathcal {N}}(0,1)$ as desired.

$\Box$

Remark.

Equivalently, we can write ${\hat {\theta }}\;{\overset {d}{\to }}\;{\mathcal {N}}(\theta ,1/{\mathcal {I}}_{n}(\theta ))$ . Thus, the variance of MLE of $\theta$ achieves the CRLB of $\theta$ asymptotically. This means the MLE of $\theta$ is the UMVUE of $\theta$ asymptotically.
The regularity conditions are basically similar to the regularity conditions mentioned in Cramer-Rao inequality.

Since we are not able to use the CRLB to find UMVUE in some situations, we will introduce another method to find UMVUE in the following, which uses the concepts of sufficiency and completeness.

Sufficiency

Intuitively, a sufficient statistic $T(X_{1},\dotsc ,X_{n})$ , which is a function of a given random sample $X_{1},\dotsc ,X_{n}$ , contains all information needed for estimating the unknown parameter (vector) $\theta$ . Thus, the statistic $T(X_{1},\dotsc ,X_{n})$ itself is "sufficient" for estimating the unknown parameter (vector) $\theta$ .

Formally, we can define and describe sufficient statistic as follows:

Definition. (Sufficient statistic) A statistic $T=T(X_{1},\dotsc ,X_{n})$ is a sufficient statistic for the unknown parameter (vector) $\theta$ if the conditional distribution of the random sample $X_{1},\dotsc ,X_{n}$ given $T$ does not depend on $\theta$ .

Remark.

The definition may be expressed as

$f(x_{1},\dotsc ,x_{n}|T;\theta )=f(x_{1},\dotsc ,x_{n}|T)$

where

f

is the joint pdf or pmf of

X_{1},\dotsc ,X_{n}

.

The equation means the joint conditional pmf or pdf of $X_{1},\dotsc ,X_{n}$ given (the value of) $T$ is the same as the joint conditional pmf or pdf of $X_{1},\dotsc ,X_{n}$ given (the values of $T$ ), and with the parameter value $\theta$ .

This means the pmf of pdf is not changed even if the parameter value $\theta$ is provided, which in turn means the joint conditional pmf or pdf of $X_{1},\dotsc ,X_{n}$ , given the value of $T$ , actually does not depend on $\theta$ .

$f(x_{1},\dotsc ,x_{n}|T)$ refers to $f_{X_{1},\dotsc ,X_{n}|T}(x_{1},\dotsc ,x_{n}|t)$ before the realization $T=t$ , and it is a random variable (the randomness comes from $T$ ).
After the realization $T=t$ , the equation still holds ( $T$ is modified to $T=t$ ).

Example. Consider a random sample $X_{1},\dotsc ,X_{n}$ from ${\mathcal {N}}(\mu ,\sigma ^{2})$ . It can be shown that ${\overline {X}}$ is a sufficient statistic for $\mu$ , but not a sufficient statistic for $\sigma ^{2}$ .

This can be shown by applying the definition. However, we will later give an alternative and often more convenient method to check the sufficiency of a statistic, and find sufficient statistics. We will explain informally why it is true here.

${\overline {X}}$ contains the information of the central tendency of the distribution, which should be the information needed to estimate the mean $\mu$ . Hence, it is a sufficient statistic for $\mu$ .
However, ${\overline {X}}$ does not contain the information of the dispersion of the distribution (it only tells that the "central location", but for a particular central location, the dispersion can be very different), which should be the information needed to estimate the variance $\sigma ^{2}$ . Hence, it is not a sufficient statistic for $\sigma ^{2}$ .

Remark.

From here, we can also expect that the sufficient statistic is not unique, since, for example, $2{\overline {X}}$ should also contain the information of the central tendency (since we can divide it by 2 to get the value of ${\overline {X}}$ , and thus get the information).
Indeed, in general, given $T$ is a sufficient statistic for $\theta$ , then $v(T)$ is also a sufficient statistic for $\theta$ , provided that $v$ is a bijective function (also known as invertible function, one-to-one correspondence, or bijection), so that its inverse exists.

Let us state the above remark about transformation of sufficient statistic formally below.

Proposition. Let $T$ be a sufficient statistic of the unknown parameter (vector) $\theta$ . Then, $v(T)$ is also a sufficient statistic of $\theta$ for each bijective function $v$ .

Now, we discuss a theorem that helps us to check the sufficiency of a statistic, namely (Fisher-Neyman) factorization theorem.

Theorem. (Factorization theorem) Let $f(x_{1},\dotsc ,x_{n};\theta )$ be the joint pdf of pmf of a random sample $X_{1},\dotsc ,X_{n}$ . A statistic $T=T(X_{1},\dotsc ,X_{n})$ is a sufficient statistic of $\theta$ if and only if there exist functions $g$ and $h$ such that $f(x_{1},\dotsc ,x_{n};\theta )=g(T(x_{1},\dotsc ,x_{n});\theta )h(x_{1},\dotsc ,x_{n})$ where $g$ depends on $x_{1},\dotsc ,x_{n}$ only through $T(x_{1},\dotsc ,x_{n})$ , and $h$ does not depend on $\theta$ .

Proof. Since the proof for continuous case is quite complicated, we will only give a proof for the discrete case. For simplicity of presentation, let $\mathbf {X} =(X_{1},\dotsc ,X_{n})$ , $T=T(X_{1},\dotsc ,X_{n})$ , $\mathbf {x} =(x_{1},\dotsc ,x_{n})$ , and $t=T(x_{1},\dotsc ,x_{n})$ , and hence there are notations for different types of pmfs from these. By definition, $f_{\mathbf {X} |T}(\mathbf {x} |t;\theta )=f_{\mathbf {X} |T}(\mathbf {x} ,t)$ . Also, we have $\mathbf {X} =\mathbf {x} \iff \mathbf {X} =\mathbf {x} \cap T(\mathbf {X} )=T(\mathbf {x} )\iff \mathbf {X} =\mathbf {x} \cap T=t$ . Thus, we can write $f_{\mathbf {X} ,T}(\mathbf {x} ,t;\theta )=f_{\mathbf {X} }(\mathbf {x} ;\theta )\quad (*)$ .

"only if" ( $\Rightarrow$ ) direction: Assume $T$ is a sufficient statistic. Then, we choose $g(t;\theta )=f_{T}(t;\theta )$ and $h(\mathbf {x} )=f_{\mathbf {X} |T}(\mathbf {x} |t)$ , which does not depend on $\theta$ by the definition of sufficient statistic. It remains to verify that the equation actually holds for this choice.

Hence, $f_{\mathbf {X} }(\mathbf {x} ;\theta )=f_{\mathbf {X} ,T}(\mathbf {x} ,t;\theta ){\overset {\text{ def }}{=}}f_{\mathbf {X} |T}(\mathbf {x} |t;\theta )f_{T}(t;\theta ){\overset {\text{ sufficiency }}{=}}f_{\mathbf {X} |T}(\mathbf {x} |t)f_{T}(t;\theta )=h(\mathbf {x} )g(t;\theta ).$

"if" ( $\Leftarrow$ ) direction: Assume we can write $f_{\mathbf {X} }(\mathbf {x} ;\theta )=g(t;\theta )h(\mathbf {x} )$ . Then, $f_{T}(t;\theta ){\overset {\text{ marginal pmf }}{=}}\sum _{\mathbf {x} }^{}f_{\mathbf {X} ,T}(\mathbf {x} ,t;\theta ){\overset {\text{ (*) }}{=}}\sum _{\mathbf {x} }^{}f_{\mathbf {X} }(\mathbf {x} ;\theta ){\overset {\text{ assumption }}{=}}\sum _{\mathbf {x} }^{}g(t;\theta )h(\mathbf {x} )=\underbrace {g(t;\theta )} _{{\text{independent from }}\mathbf {x} }\sum _{\mathbf {x} }^{}h(\mathbf {x} ).$ Now, we aim to show that $f_{\mathbf {X} |T}(\mathbf {x} |t)$ does not depend on $\theta$ , which means $T$ is a sufficient statistic for $\theta$ . We have $f_{\mathbf {X} |T}(\mathbf {x} |t){\overset {\text{ def }}{=}}{\frac {f_{\mathbf {X} ,T}(\mathbf {x} ,t;\theta )}{f_{T}(t;\theta )}}{\overset {\text{ (*) }}{=}}{\frac {f_{\mathbf {X} }(\mathbf {x} ;\theta )}{f_{T}(t;\theta )}}={\frac {\overbrace {g(t;\theta )h(\mathbf {x} )} ^{\text{assumption}}}{\underbrace {g(t;\theta )\sum _{\mathbf {x} }^{}h(\mathbf {x} )} _{\text{above}}}}={\frac {h(\mathbf {x} )}{\sum _{\mathbf {x} }^{}h(\mathbf {x} )}},$ which does not depend on $\theta$ , as desired.

$\Box$

Remark.

$h(x_{1},\dotsc ,x_{n})$ can also be a constant, which does not depend on $\theta$ clearly.

Example. Consider a random sample $X_{1},\dotsc ,X_{n}$ from ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Find a sufficient statistic of $\theta =(\mu ,\sigma ^{2})$ .

Solution: The joint pdf of $X_{1},\dotsc ,X_{n}$ is ${\begin{aligned}f(x_{1},\dotsc ,x_{n};\theta )&=\prod _{i=1}^{n}{\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x_{i}-\mu )^{2}}{2\sigma ^{2}}}\right)\\&=(2\pi \sigma ^{2})^{-n/2}\exp \left(\sum _{i=1}^{n}{\frac {(x_{i}-\mu )^{2}}{2\sigma ^{2}}}\right)\\&=(2\pi \sigma ^{2})^{-n/2}\exp \left(\sum _{i=1}^{n}{\frac {(x_{i}{\color {darkgreen}-{\overline {x}}+{\overline {x}}}-\mu )^{2}}{2\sigma ^{2}}}\right)\\&=(2\pi \sigma ^{2})^{-n/2}\exp \left(\sum _{i=1}^{n}{\frac {(x_{i}{\color {darkgreen}-{\overline {x}}})^{2}+2(x_{i}-{\overline {x}})({\overline {x}}-\mu )+({\color {darkgreen}{\overline {x}}}-\mu )^{2}}{2\sigma ^{2}}}\right)\\&=(2\pi \sigma ^{2})^{-n/2}\exp \left(\sum _{i=1}^{n}{\frac {(x_{i}{\color {darkgreen}-{\overline {x}}})^{2}+({\color {darkgreen}{\overline {x}}}-\mu )^{2}}{2\sigma ^{2}}}\right)&\left(\sum _{i=1}^{n}(x_{i}-{\overline {x}})({\overline {x}}-\mu )=({\overline {x}}-\mu )\sum _{i=1}^{n}(x_{i}-{\overline {x}})=({\overline {x}}-\mu )\left(\sum _{i=1}^{n}x_{i}-\sum _{i=1}^{n}{\overline {x}}\right)=({\overline {x}}-\mu )(n{\overline {x}}-n{\overline {x}})=0\right)\\&=(2\pi \sigma ^{2})^{-n/2}\exp \left({\frac {1}{2\sigma ^{2}}}\left(\sum _{i=1}^{n}(x_{i}{\color {darkgreen}-{\overline {x}}})^{2}+\sum _{i=1}^{n}({\color {darkgreen}{\overline {x}}}-\mu )^{2}\right)\right)\\&=\underbrace {(2\pi )^{-n/2}} _{h(x_{1},\dotsc ,x_{n})}\underbrace {\sigma ^{-n}\exp \left({\frac {1}{2\sigma ^{2}}}\left(ns^{2}+n({\overline {x}}-\mu )^{2}\right)\right)} _{g(T(x_{1},\dotsc ,x_{n});\theta )}&\left(({\overline {x}}-\mu )^{2}{\text{ is independent from }}i\right).\\\end{aligned}}$ Notice that the function $g$ depends on $x_{1},\dotsc ,x_{n}$ only through $T(x_{1},\dotsc ,x_{n})=({\overline {x}},s^{2})$ , so we can conclude that $T(X_{1},\dotsc ,X_{n})=({\overline {X}},S^{2})$ .

Remark.

We can also write $({\overline {X}},S^{2})$ as $(S^{2},{\overline {X}})$ , which is also a sufficient statistic for $\theta$ .

Intuitively, this is because the latter one also contains the same statistics, and thus contains the same information.
Alternatively, we can define the function $v$ as $(z_{1},z_{2})\mapsto (z_{2},z_{1})$ , which is a bijective function, so $v({\overline {X}},S^{2})=(S^{2},{\overline {X}})$ is also a sufficient statistic for $\theta$ .

We need to separate " $\sigma ^{-n}$ " out from " $(2\pi \sigma ^{2})^{-n/2}$ ", since for the function $h(x_{1},\dotsc ,x_{n})$ , it cannot depend on $\theta =(\mu ,\sigma ^{2})$ . Hence, we cannot include " $\sigma ^{-n}$ " in the definition of the $h(x_{1},\dotsc ,x_{n})$ function.
There are many ways to define the $g$ and $h$ functions in this case.

For some "nice" distributions, which belong to exponential family, sufficient statistics can be found using another alternative method easily and more conveniently. This method works because of the "nice" form of the pdf or pmf of those distributions, which can be characterized as follows:

Definition. (Exponential family) The distribution of a random variable $X$ belongs to the exponential distribution if the pdf or pmf of $X$ has the form of $f(x;\theta )=h(x)g(\theta )\exp \left(\sum _{i=1}^{\color {darkgreen}s}\eta _{i}(\theta )T_{i}(x)\right)$ where $\theta =(\theta _{1},\dotsc ,\theta _{\color {darkgreen}s})\in \Theta \subseteq \mathbb {R} ^{\color {darkgreen}s}$ , for some functions $h,g,\eta _{i},T_{i}$ ( $i=1,2\dotsc ,s$ ).

Remark.

The value of $s$ depends on the number of unknown parameters.

Notice that $s$ can be 1, and in this case the " $\theta$ " is just a single parameter.

Exponential family includes many common distributions, e.g. normal, exponential, gamma, chi squared, beta, Bernoulli, Poisson, geometric, etc.

However, some common distributions do not belong to exponential family, e.g. Student's $t$ -distribution, $F$ -distribution, Cauchy distribution and hypergeometric distribution.

Example. The normal distribution belongs to the exponential family, where $\theta =(\mu ,\sigma ^{2})\in \mathbb {R} ^{2}$ (so the " $s$ " is 2 in this case), since its pdf can be expressed as $f(x;\theta )={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {x^{2}-2\mu x+\mu ^{2}}{2\sigma ^{2}}}\right)=\left[{\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {\mu ^{2}}{2\sigma ^{2}}}\right)\right]\exp \left(-{\frac {x^{2}-2\mu x}{2\sigma ^{2}}}\right)=\underbrace {(1)} _{h(x)}\underbrace {\left[{\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {\mu ^{2}}{2\sigma ^{2}}}\right)\right]} _{g(\theta )}\exp {\Bigg [}\underbrace {\frac {\mu }{\sigma ^{2}}} _{\eta _{1}(\theta )}\cdot \underbrace {x} _{T_{1}(x)}+\underbrace {\left(-{\frac {1}{2\sigma ^{2}}}\right)} _{\eta _{2}(\theta )}\cdot \underbrace {x^{2}} _{T_{2}(x)}{\Bigg ]}$

Theorem. (Sufficient statistic for exponential family) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution belonging to the exponential family, with pdf or pmf $f(x;\theta )$ where $\theta \in \mathbb {R} ^{s}$ . Then, a sufficient statistic for $\theta$ is $T(X_{1},\dotsc ,X_{n})=\left(\sum _{j=1}^{n}T_{1}(X_{j}),\dotsc ,\sum _{j=1}^{n}T_{s}(X_{j})\right).$

Proof. Since the distribution belongs to the exponential family, the joint pdf or pmf of $X_{1},\dotsc ,X_{n}$ can be expressed as ${\begin{aligned}f(x_{1},\dotsc ,x_{n};\theta )&=\prod _{{\color {blue}j}=1}^{n}\left[h(x_{\color {blue}j})g(\theta )\exp \left(\sum _{i=1}^{\color {darkgreen}s}\eta _{i}(\theta )T_{i}(x_{\color {blue}j})\right)\right]\\&=\left[\prod _{j=1}^{n}h(x_{j})\right](g(\theta ))^{n}\exp \left(\sum _{{\color {blue}j}=1}^{n}\sum _{i=1}^{s}\eta _{i}(\theta )T_{i}(x_{\color {blue}j})\right)\\&=\left[\prod _{j=1}^{n}h(x_{j})\right](g(\theta ))^{n}\exp \left(\sum _{i=1}^{s}\sum _{{\color {blue}j}=1}^{n}\eta _{i}(\theta )T_{i}(x_{\color {blue}j})\right)&({\text{changing summation order, where the upper bounds are constants}})\\&=\left[\prod _{j=1}^{n}h(x_{j})\right](g(\theta ))^{n}\exp \left(\sum _{i=1}^{s}\underbrace {\eta _{i}(\theta )} _{{\text{independent from }}j}\sum _{{\color {blue}j}=1}^{n}T_{i}(x_{\color {blue}j})\right)\\&={\color {purple}\left[\prod _{j=1}^{n}h(x_{j})\right]}{\color {red}(g(\theta ))^{n}\exp \left(\eta _{1}(\theta )\sum _{{\color {blue}j}=1}^{n}T_{1}(x_{\color {blue}j})+\dotsb +\eta _{s}(\theta )\sum _{{\color {blue}j}=1}^{n}T_{s}(x_{\color {blue}j})\right)}.\\\end{aligned}}$ From here, for applying the factorization theorem, we can identify the purple part of the function as " $h(x_{1},\dotsc ,x_{n})$ ", and the red part of the function as " $g(T(x_{1},\dotsc ,x_{n});\theta )$ ". We can notice that the red part of the function depends on $x_{1},\dotsc ,x_{n}$ only through $\left(\sum _{j=1}^{n}T_{1}(x_{j}),\dotsc ,\sum _{j=1}^{n}T_{s}(x_{j})\right)$ . The result follows.

$\Box$

Example. Consider a random sample $X_{1},\dotsc ,X_{n}$ from ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Show that a sufficient statistic of $\theta =(\mu ,\sigma ^{2})$ is $\left({\overline {X}},S^{2}\right)$ , using the result for finding sufficient statistic for exponential family.

Proof. From previous example, we have shown that normal distribution belongs to the exponential family, and from the expression there, we can see that a sufficient statistic of $\theta$ is $T=\left(\sum _{j=1}^{n}X,\sum _{j=1}^{n}X^{2}\right)=\left(n{\overline {X}},n{\overline {X^{2}}}\right)$ .

Since $S^{2}={\frac {1}{n}}\sum _{j=1}^{n}(X_{j}-{\overline {X}})^{2}={\frac {1}{n}}\sum _{j=1}^{n}\left(X_{j}^{2}-2X_{j}{\overline {X}}+({\overline {X}})^{2}\right)={\frac {\sum _{j=1}^{n}X_{j}^{2}}{n}}-{\frac {2{\overline {X}}}{n}}\sum _{j=1}^{n}X_{j}+({\overline {X}})^{2}={\overline {X^{2}}}-2({\overline {X}})^{2}+({\overline {X}})^{2}={\overline {X^{2}}}-({\overline {X}})^{2}$ , we can define function $v$ as $(z_{1},z_{2})\mapsto \left(z_{1}/n,z_{2}/n-(z_{1}/n)^{2}\right),$ which can be shown to be an bijective function.

Thus, $v(T)=\left({\overline {X}},S^{2}\right)$ is also an sufficient statistic of $\theta$ .

$\Box$

Now, we will start discussing how is sufficient statistic related to UMVUE. We begin our discussion by Rao-Blackwell theorem.

Theorem. (Rao-Blackwell theorem) Let $W$ be an arbitrary unbiased estimator of $\tau (\theta )$ , and $T$ be a sufficient statistic of $\theta$ . Define $\varphi (T)=\mathbb {E} [W|T]$ . Then, $\varphi (T)$ is an unbiased estimator of $\tau (\theta )$ and $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ .

Proof. Assume $W$ is an arbitrary unbiased estimator of $\tau (\theta )$ , and $T$ is a sufficient statistic of $\theta$ .

First, we prove that $\varphi (T)$ is an unbiased estimator of $\tau (\theta )$ . Before proving the unbiasedness, we should ensure that $\varphi (T)$ is actually an estimator, i.e., it is a statistic, which is a function of random sample, and needs to be independent from $\theta$ (so that it is calculable): since $W$ is a function of random sample, and $T$ is a sufficient statistic, which make the conditional distribution of $W$ , given $T$ , independent of $\theta$ . Also, $\varphi (T)=\mathbb {E} [W|T]$ is a function of $W$ , and thus is also a function of random sample.

Now, we prove that $\varphi (T)$ is an unbiased estimator of $\tau (\theta )$ : since $\mathbb {E} [\varphi (T)]=\mathbb {E} [\mathbb {E} [W|T]]{\overset {\text{ law of total expectation }}{=}}\mathbb {E} [W]{\overset {\text{ unbiasedness }}{=}}\tau (\theta )$ , $\varphi (T)$ is an unbiased estimator of $\tau (\theta )$ .

Next, we prove that $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ : by law of total variance, we have $\operatorname {Var} (W)=\operatorname {Var} (\mathbb {E} [W|T])+\mathbb {E} [\operatorname {Var} (W|T)]{\overset {\text{ def }}{=}}\operatorname {Var} (\varphi (T))+\overbrace {\mathbb {E} [\underbrace {\operatorname {Var} (W|T)} _{\geq 0}]} ^{\geq 0}\geq \operatorname {Var} (\varphi (T)),$ as desired.

$\Box$

Remark.

The random variable $\varphi (T)=\mathbb {E} [W|T]$ is determined by first finding $\varphi (t)=\mathbb {E} [W|T=t]$ , and then replacing $t$ by $T$ . Here, $\varphi (t)$ is the realization of $\varphi (T)$ .
From Rao-Blackwell theorem, we know that $\varphi (T)=\mathbb {E} [W|T]$ is a better (or at least "same quality") estimator than $W$ in efficiency sense. Notice that the theorem does not state that $\varphi (T)$ is the best estimator in efficiency sense (i.e., UMVUE). Instead, it only states that $\varphi (T)$ is better than $W$ in efficiency sense.
After applying this theorem once, the $\varphi (T)$ can act as the "arbitrary unbiased estimator of $\tau (\theta )$ ", and we can apply this theorem again, and so on. This means that after applying this theorem many times, the " $\varphi (T)$ " we get will be an UMVUE.

We can interpret this process as keep "improving" the unbiased estimator $W$ , until it becomes the best one (in efficiency sense), i.e., it is an UMVUE.
Since UMVUE is unique, UMVUE must be the conditional expectation of a random variable given a sufficient statistic $T$ , which is a function of $T$ .
Thus, we now can narrow the candidates for UMVUE to functions of sufficient statistic $T$ .

To actually determine the UMVUE, we need another theorem, called Lehmann-Scheffé theorem, which is based on Rao-Blackwell theorem, and requires the concept of completeness.

Completeness

Definition. (Complete statistic) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a distribution with a parameter (vector) $\theta$ lying in the parameter space $\Theta$ . A statistic $T$ is a complete statistic if $\mathbb {E} [g(T)]=0$ for each $\theta \in \Theta$ implies $\mathbb {P} (g(T)=0)=1$ for each $\theta \in \Theta$ .

When a random sample $X_{1},\dotsc ,X_{n}$ is from a distribution in exponential family, then a complete statistic can also be founded easily, similar to the case for sufficient statistic.

Theorem. (Complete statistic for exponential family) If $X_{1},\dotsc ,X_{n}$ is a random sample from a distribution in exponential family where the unknown parameter (vector) $\theta \in \Theta \subseteq \mathbb {R} ^{\color {darkgreen}s}$ , Then, $T(X_{1},\dotsc ,X_{n})=\left(\sum _{j=1}^{n}T_{1}(X_{j}),\sum _{j=1}^{n}T_{2}(X_{j}),\dotsc ,\sum _{j=1}^{n}T_{\color {darkgreen}s}(X_{j})\right)$ is a complete statistic, given that the parameter space $\Theta$ contains an open set in $\mathbb {R} ^{\color {darkgreen}s}$ .

Proof. Omitted.

$\Box$

Remark.

Open sets are a generalization of open intervals. Indeed, open sets in $\mathbb {R}$ is simply open intervals.
Intuitively, open sets refers to the sets that, for each point in the set, the set contain all points that are sufficiently near to that point.
For example, a line in $\mathbb {R} ^{2}$ (which can be interpreted as a set) is not an open set since for each point in the line, the line does not contain all points that are sufficiently near to that point (there are some points "above" and "below" the line that are not contained in the set).
Also, a disk (a region in a plane bounded by a circle) in $\mathbb {R} ^{3}$ is not an open set since for each point in the disk, the disk does not contain all points that are sufficiently near to that point (there are some points "above" and "below" the disk that are not contained in the disk).
From the previous theorem about sufficient statistic for exponential family, we know that $T(X_{1},\dotsc ,X_{n})$ is also a sufficient statistic of $\theta$ under such conditions.

When a statistic is sufficient for a parameter (vector) $\theta$ and is also a complete statistic, we call such statistic as a complete and sufficient statistic for $\theta$ .

Theorem. (Lehmann-Scheffé theorem) If $T$ is a complete and sufficient statistic for $\theta$ and $\mathbb {E} [\varphi (T)]=\tau (\theta )$ , then $\varphi (T)$ is the unique UMVUE of $\tau (\theta )$ (with probability 1).

Proof. Assume $T$ is a complete and sufficient statistic for $\theta$ and $\mathbb {E} [\varphi (T)]=\tau (\theta )$ .

Since $T$ is a sufficient statistic for $\theta$ , we can apply the Rao-Blackwell theorem. From Rao-Blackwell theorem, if $W$ is an arbitrary unbiased estimator of $\tau (\theta )$ , then $\varphi (T)$ is another unbiased estimator where $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ .

To prove that $\varphi (T)$ is the unique UMVUE of $\tau (\theta )$ , we proceed to show that regardless of the choice of the unbiased estimator $W$ of $\tau (\theta )$ , we get the same $\varphi (T)$ from the Rao-Blackwell theorem (with probability 1). Then, we will have for every possible unbiased estimator $W$ of $\tau (\theta )$ , $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ (with probability 1) ^[8], which means $\varphi (T)$ is the UMVUE, and is also the unique UMVUE since we always get the same $\varphi (T)$ ^[9].

Assume that $W'$ is another unbiased estimator of $\tau (\theta )$ ( $W'\neq W$ ). By Rao-Blackwell theorem again, there is an unbiased estimator $\psi (T)=\mathbb {E} [W'|T]$ ( $\psi (T)\neq \varphi (T)$ ) where $\operatorname {Var} (\psi (T))\leq \operatorname {Var} (W')$ . Since both $\varphi (T)$ and $\psi (T)$ are unbiased estimators of $\tau (\theta )$ , we have for each $\theta \in \Theta$ , $\mathbb {E} [\varphi (T)]=\mathbb {E} [\psi (T)]\implies \mathbb {E} [\varphi (T)-\psi (T)]=0.$ Since $T$ is a complete statistic, we have $\mathbb {P} (\varphi (T)-\psi (T)=0)=1\implies \mathbb {P} (\varphi (T)=\psi (T))=1,$ which means $\varphi (T)=\psi (T)$ (with probability 1), i.e., we get the same $\varphi (T)$ from the Rao-Blackwell theorem in this case (with probability 1).

$\Box$

Remark.

The " $\varphi (T)$ " in this theorem is a function of $T$ , and we know from the proof and Rao-Blackwell theorem that it is actually $\mathbb {E} [W|T]$ where $W$ is an arbitrary unbiased estimator of $\tau (\theta )$ .

Thus, when we apply this theorem, as long as we can find a function of $T$ , $\phi (T)$ , (perhaps by some inspections) such that $\mathbb {E} [\phi (T)]=\tau (\theta )$ , we know that $\phi (T)$ is the unique UMVUE of $\tau (\theta )$ . Also, due to the uniqueness of UMVUE, the $\phi (T)$ is actually $\varphi (T)=\mathbb {E} [W|T]$ where $W$ is an arbitrary unbiased estimator of $\tau (\theta )$ .
We can find $\varphi (T)$ by some inspections, as in above, in simple cases. However, in more complicated cases, it may not be immediately transparent that what should be the explicit form of $\varphi (T)$ such that $\mathbb {E} [\varphi (T)]=\tau (\theta )$ . In such case, we need to find an unbiased estimator of $\tau (\theta )$ and evaluate $\mathbb {E} [W|T]$ to get the explicit form of $\varphi (T)$ .

Example. Consider a random sample $X_{1},\dotsc ,X_{n}$ from ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Let the unknown parameter vector $\theta =(\mu ,\sigma ^{2})$ .

(a) Show that the sufficient statistic for $\theta$ , namely $\left({\overline {X}},S^{2}\right)$ , is also a complete statistic.

(b) Hence, show that ${\overline {X}}$ and ${\frac {n}{n-1}}\cdot S^{2}$ is the UMVUE of $\mu$ and $\sigma ^{2}$ respectively.

Solution:

(a)

Proof. It suffices to show that the parameter space $\Theta =\{(\mu ,\sigma ^{2}):\mu \in \mathbb {R} ,\sigma ^{2}>0\}$ contains an open set in $\mathbb {R} ^{2}$ . This is true since the parameter space $\Theta$ is the whole region above $x$ -axis if we represent it using the Cartesian coordinate system, and thus contains an open set.

$\Box$

(b)

Proof. Since $\mathbb {E} [{\overline {X}}]=\mu$ and $\mathbb {E} \left[{\frac {n}{n-1}}\cdot S^{2}\right]=\sigma ^{2}$ (we have shown these before), and ${\overline {X}}$ and ${\frac {n}{n-1}}\cdot S^{2}$ is function of complete and sufficient statistic ${\overline {X}}$ (of $\mu$ ) and $S^{2}$ (of $\sigma ^{2}$ ) respectively, by Lehmann-Scheffé theorem, we have the desired result.

$\Box$

Remark.

We have shown that ${\frac {n}{n-1}}\cdot S^{2}$ does not attain the CRLB of $\sigma ^{2}$ , and the CRLB of $\sigma ^{2}$ is actually unattainable. Thus, we were not able to determine whether ${\frac {n}{n-1}}\cdot S^{2}$ is the UMVUE of $\sigma ^{2}$ before. Now, we know that ${\frac {n}{n-1}}\cdot S^{2}$ is actually the UMVUE of $\sigma ^{2}$ with the help of Lehmann-Scheffé theorem.

Example. Consider a random sample $X_{1},\dotsc ,X_{n}$ from the Bernoulli distribution with success probability $p$ , i.e., $\operatorname {Ber} (p)$ , with pmf $f(x;p)=p^{x}(1-p)^{1-x},\quad x=0,1$ .

(a) Find a complete and sufficient statistic $T$ for $p$ .

(b) Hence, find the UMVUE of $p$ .

(c) Show that $\mathbf {1} \{X_{1}=1\}$ is an unbiased estimator of $p$ , and $\mathbb {E} [\mathbf {1} \{X_{1}=1\}|T]$ is the UMVUE of $p$ .

Solution:

(a) The pmf $f(x;p)=p^{x}(1-p)^{1-x}=(1-p)\left({\frac {p}{1-p}}\right)^{x}=\underbrace {(1)} _{h(x)}\underbrace {(1-p)} _{g(\theta )}\exp \left(\underbrace {x} _{T(x)}\underbrace {\ln \left({\frac {p}{1-p}}\right)} _{\eta (p)}\right)$ . This means Bernoulli distribution belongs to the exponential family. Also, the parameter space $\Theta =\{p:0\leq p\leq 1\}$ contains an open set in $\mathbb {R}$ . Hence, $T=\sum _{j=1}^{n}X_{j}$ is a complete and sufficient statistic for $p$ .

(b) Notice that $\mathbb {E} [T/n]=\mathbb {E} [{\overline {X}}]={\frac {np}{n}}=p$ . Hence, ${\overline {X}}$ (which is a function of $T$ ) is the UMVUE of $p$ .

(c)

Proof. Since $\mathbb {E} [\mathbf {1} \{X_{1}=1\}]=(1)\mathbb {P} (X_{1}=1)=p$ , $\mathbf {1} \{X_{1}=1\}$ is an unbiased estimator of $p$ .

Now, we consider $\mathbb {E} [\mathbf {1} \{X_{1}=1\}|T]=\mathbb {E} \left[\mathbf {1} \{X_{1}=1\}|\sum _{j=1}^{n}X_{j}\right]$ . We denote $\sum _{j=1}^{n}X_{j}$ by $S_{n}$ . Then, this expectation becomes $\mathbb {E} [\mathbf {1} \{X_{1}=1\}|S_{n}]$ . In the following, we evaluate $\mathbb {E} [\mathbf {1} \{X_{1}=1\}|S_{n}=s_{n}]$ . ${\begin{aligned}\mathbb {E} \left[\mathbf {1} \{X_{1}=1\}|\sum _{j=1}^{n}X_{j}=s_{n}\right]&=(1)\mathbb {P} \left(\mathbf {1} \{X_{1}=1\}=1|\sum _{j=1}^{n}X_{j}=s_{n}\right)&({\text{definition}})\\&=\mathbb {P} \left(X_{1}=1|\sum _{j=1}^{n}X_{j}=s_{n}\right)\\&={\frac {\mathbb {P} \left(\sum _{j=1}^{n}X_{j}=s_{n}|X_{1}=1\right)\mathbb {P} (X_{1}=1)}{\mathbb {P} \left(\sum _{j=1}^{n}X_{j}=s_{n}\right)}}&({\text{Bayes' theorem}})\\&={\frac {\mathbb {P} \left(\sum _{j=2}^{n}X_{j}=s_{n}-1\right)\cdot p}{\mathbb {P} \left(\sum _{j=1}^{n}X_{j}=s_{n}\right)}}\\\end{aligned}}$ Notice that $\sum _{j=1}^{n}X_{j}$ follows the binomial distribution with $n$ trials with success probability $p$ , i.e., $\operatorname {Binom} (n,p)$ , and $\sum _{j=2}^{n}X_{j}\sim \operatorname {Binom} (n-1,p)$ . Hence, ${\begin{aligned}{\frac {\mathbb {P} \left(\sum _{j=2}^{n}X_{j}=s_{n}-1\right)\cdot p}{\mathbb {P} \left(\sum _{j=1}^{n}X_{j}=s_{n}\right)}}&={\frac {{\binom {n-1}{s_{n}-1}}p^{s_{n}-1}(1-p)^{n-1-s_{n}+1}\cdot p}{{\binom {n}{s_{n}}}p^{s_{n}}(1-p)^{n-s_{n}}}}&({\text{binomial distribution pmf's}})\\&={\frac {\frac {(n-1)!}{(s_{n}-1)!(n-s_{n})!}}{\frac {n!}{s_{n}!(n-s_{n})!}}}\\&={\frac {(n-1)!s_{n}(s_{n}-1)!}{n(n-1)!(s_{n}-1)!}}&(s_{n}!=s_{n}(s_{n}-1)!{\text{ and }}n!=n(n-1)!)\\&={\frac {s_{n}}{n}}.\end{aligned}}$ Now, replacing $s_{n}$ by $S_{n}=\sum _{j=1}^{n}X_{j}$ gives $\mathbb {E} \left[\mathbf {1} \{X_{1}=1\}|\sum _{j=1}^{n}X_{j}\right]={\frac {\sum _{j=1}^{n}X_{j}}{n}}={\overline {X}},$ which is the UMVUE of $p$ , as desired.

$\Box$

Exercise. Can we find the UMVUE of $p$ using the CRLB of $p$ ? If yes, find it using this way. If no, explain why.

Solution

No. This is because the log-likelihood function is not differentiable (it has nonzero value only when $x=0,1$ ), and thus the Fisher information is undefined. Hence, the CRLB does not exist.

Exercise. Consider a random sample $X_{1},\dotsc ,X_{n}$ from the Poisson distribution with rate parameter $\lambda$ , with pmf $f(x;\lambda )={\frac {e^{-\lambda }\lambda ^{x}}{x!}}$ .

(a) Find a complete and sufficient statistic for $\lambda$ .

(b) Find the UMVUE of $\lambda /n$ .

Solution

(a) The pmf is $f(x;\lambda )={\frac {e^{-\lambda }\lambda ^{x}}{x!}}={\frac {e^{-\lambda }}{x!}}\underbrace {\exp(x\ln \lambda )} _{=\lambda ^{x}}=\underbrace {\frac {1}{x!}} _{h(x)}\cdot \underbrace {e^{-\lambda }} _{g(\lambda )}\exp(\underbrace {x} _{T(x)}\underbrace {\ln \lambda } _{\eta (\lambda )}).$ Hence, Poisson distribution belongs to the exponential family, and a complete and sufficient statistic for $\lambda$ is $T=\sum _{j=1}^{n}X_{j}$ .

(b) Take $\tau (\lambda )=\lambda /n$ . Since $\mathbb {E} [T]=\mathbb {E} \left[\sum _{j=1}^{n}X_{j}\right]=n\lambda$ , we have $\mathbb {E} [T/n^{2}]=\lambda /n=\tau (\lambda ).$ Thus, the UMVUE of $\tau (\lambda )=\lambda /n$ is ${\frac {\sum _{j=1}^{n}X_{j}}{n^{2}}}={\frac {\overline {X}}{n}}$ (which is a function of $T$ ).

Consistency

In the previous sections, we have discussed unbiasedness and efficiency. In this section, we will discuss another property called consistency.

Definition. (Consistent estimator) ${\hat {\theta }}$ is a consistent estimator of the unknown parameter $\theta$ if ${\hat {\theta }}\;{\overset {p}{\to }}\;\theta$ .

Remark.

By the definition of convergence in probability, ${\hat {\theta }}\;{\overset {p}{\to }}\;\theta$ means $\mathbb {P} (|{\hat {\theta }}-\theta |>\varepsilon )\to 0$ as $n\to \infty$ , for each $\varepsilon >0$ .

Proposition. If ${\hat {\theta }}$ is an (asymptotically) unbiased estimator of an unknown parameter $\theta$ and $\operatorname {Var} ({\hat {\theta }})\to 0$ as $n\to \infty$ , then ${\hat {\theta }}$ is a consistent estimator of $\theta$ .

Proof. Assume ${\hat {\theta }}$ is an (asymptotically) unbiased estimator of an unknown parameter $\theta$ and $\operatorname {Var} ({\hat {\theta }})\to 0$ as $n\to \infty$ . Since ${\hat {\theta }}$ is an (asymptotically) unbiased estimator of $\theta$ , we have $\lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0$ (this is true for both asymptotically unbiased estimator and unbiased estimator of $\theta$ ). In addition to this, we have by assumption that $\lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0$ . By definition of mean squared error, these imply that $\lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0\Rightarrow \lim _{n\to \infty }\mathbb {E} [({\hat {\theta }}-\theta )^{2}]=0$ . Thus, as $n\to \infty$ , we have by Chebyshov's inequality (notice that $\operatorname {MSE} ({\hat {\theta }})=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]$ exist from above), for each $\varepsilon >0$ , $\mathbb {P} (|{\hat {\theta }}-\theta |>\varepsilon )\leq {\frac {\mathbb {E} [({\hat {\theta }}-\theta )^{2}]}{\varepsilon ^{2}}}\to {\frac {0}{\varepsilon ^{2}}}=0.$ Since probability is nonnegative ( $\geq 0$ ), and this probability is less than or equal to an expression that tends to be 0 as $n\to \infty$ , we conclude that this probability tends to be zero as $n\to \infty$ . That is, ${\hat {\theta }}$ is a consistent estimator of $\theta$ .

$\Box$

Remark.

Unbiasedness alone does not imply consistency.

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from ${\mathcal {N}}(\mu ,\sigma ^{2})$ . Then, $X_{1}$ is an unbiased estimator of $\mu$ since $\mathbb {E} [X_{1}]=\mu$ . However, there exist some $\varepsilon >0$ such that $\mathbb {P} (|X_{1}-\mu |>\varepsilon )\nrightarrow 0$ as $n\to \infty$ , i.e., $\lim _{n\to \infty }\mathbb {P} (|X_{1}-\mu |>\varepsilon )\neq 0$ for some $\varepsilon >0$ . Since $|X_{1}-\mu |$ is independent from $n$ , this means $\mathbb {P} (|X_{1}-\mu |>\varepsilon )\neq 0$ for some $\varepsilon >0$ , which is true. Hence, $X_{1}$ is not a consistent estimator of $\mu$ .

Exercise.

(a) Propose a consistent estimator of $\mu$ , and show that it is actually a consistent estimator of $\mu$ (Hint: consider weak law of large number).

(b) Propose a consistent estimator of the coefficient of variation (or relative standard deviation) ${\frac {\sigma }{\mu }}$ (assuming $\mu \neq 0$ so that it is defined), and show that it is actually a consistent estimator of ${\frac {\sigma }{\mu }}$ (Hint: consider weak law of large number, and properties for convergence in probability. You can use the fact that normal distribution has a finite fourth moment).

Solution

(a) ${\overline {X}}$ is a consistent estimator of $\mu$ .

Proof. By weak law of large number (notice that mean $\mu$ and variance $\sigma ^{2}$ are finite for normal distribution), ${\overline {X}}\;{\overset {p}{\to }}\;\mu$ as desired.

$\Box$

(b) ${\frac {\sqrt {S^{2}}}{\overline {X}}}$ is a consistent estimator of ${\frac {\sigma }{\mu }}$ .

Proof. By weak law of large number (variance is finite, and fourth moment is finite), ${\overline {X^{2}}}\;{\overset {p}{\to }}\;\mathbb {E} [X^{2}]$ . Also, by continuous mapping theorem, since $({\overline {X}})^{2}\;{\overset {p}{\to }}\;\mu ^{2}$ . Thus, by properties about convergence in probability and result about sample variance, $S^{2}={\overline {X^{2}}}-({\overline {X}})^{2}\;{\overset {p}{\to }}\;\mathbb {E} [X^{2}]-\mu ^{2}=\sigma ^{2}.$ By continuous mapping theorem again, ${\sqrt {S^{2}}}\;{\overset {p}{\to }}\;{\sqrt {\sigma ^{2}}}=\sigma$ (since $\sigma >0$ ). Hence, by properties about convergence in probability (we assume that $\mu \neq 0$ ) again, ${\frac {\sqrt {S^{2}}}{\overline {X}}}\;{\overset {p}{\to }}\;{\frac {\sigma }{\mu }}$ as desired.

$\Box$

Preliminaries

Statistics
Point Estimation

Interval Estimation

↑ For the parameter vector, it contains all parameters governing the distribution.
↑ We will simply use " $\theta$ " when we do not know whether it is parameter vector or just a single parameter. We may use $\theta$ instead if we know it is indeed a parameter vector.
↑ We will discuss some criterion for "good" in the #Properties of estimator section.
↑ $\beta -\beta '={\big (}\max\{x_{1},\dotsc ,x_{n}\}+\beta -\max\{x_{1},\dotsc ,x_{n}\}{\big )}-\left(\max\{x_{1},\dotsc ,x_{n}\}+{\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}\right)={\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}>0$ . Thus, $\beta '<\beta$ .
↑ For each positive integer $r$ , $m_{r}$ always exist, unlike $\mu _{r}$ .
↑ "uniformly" means that the variance is minimum compared to other unbiased estimators, over the parameter space $\Theta$ (i.e., for each possible value of $\theta \in \Theta$ ). That is, the variance is not just minimum for a particular value of $\theta$ , but all possible values of $\theta$ .
↑ This is different from the minimum value. For lower bound, it only needs to be smaller than all variances involved, and there may not be any variance that actually achieve this lower bound. However, for the minimum value, it has to be one of the values of the variance.
↑ Notice that this is a stronger result than the result in the Rao-Blackwell theorem, where the latter only states that $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ , for the $W$ corresponding to $\varphi (T)$
↑ Indeed, we know that UMVUE must be unique from previous proposition. However, in this argument, when we show that $\varphi (T)$ is UMVUE, we also automatically show that it is unique.

[1] For the parameter vector, it contains all parameters governing the distribution.

[2] We will simply use " $\theta$ " when we do not know whether it is parameter vector or just a single parameter. We may use $\theta$ instead if we know it is indeed a parameter vector.

[3] We will discuss some criterion for "good" in the #Properties of estimator section.

[4] $\beta -\beta '={\big (}\max\{x_{1},\dotsc ,x_{n}\}+\beta -\max\{x_{1},\dotsc ,x_{n}\}{\big )}-\left(\max\{x_{1},\dotsc ,x_{n}\}+{\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}\right)={\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}>0$ . Thus, $\beta '<\beta$ .

[5] For each positive integer $r$ , $m_{r}$ always exist, unlike $\mu _{r}$ .

[6] "uniformly" means that the variance is minimum compared to other unbiased estimators, over the parameter space $\Theta$ (i.e., for each possible value of $\theta \in \Theta$ ). That is, the variance is not just minimum for a particular value of $\theta$ , but all possible values of $\theta$ .

[7] This is different from the minimum value. For lower bound, it only needs to be smaller than all variances involved, and there may not be any variance that actually achieve this lower bound. However, for the minimum value, it has to be one of the values of the variance.

[8] Notice that this is a stronger result than the result in the Rao-Blackwell theorem, where the latter only states that $\operatorname {Var} (\varphi (T))\leq \operatorname {Var} (W)$ , for the $W$ corresponding to $\varphi (T)$

[9] Indeed, we know that UMVUE must be unique from previous proposition. However, in this argument, when we show that $\varphi (T)$ is UMVUE, we also automatically show that it is unique.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]