# Statistics/Point Estimation

## Introduction

Usually, a random variable ${\displaystyle X}$ resulting from a random experiment is assumed to follow a certain distribution with an unknown (but fixed) parameter (vector) [1] ${\displaystyle \theta \in \mathbb {R} ^{k}}$ [2] (${\displaystyle k}$ is a positive integer, and its value depends on the distribution), taking value in a set ${\displaystyle \Theta }$, called the parameter space.

Remark.

• In the context of frequentist statistics (the context here), parameters are regarded as fixed.
• On the other hand, in the context of Bayesian statistics, parameters are regarded as random variables.

For example, suppose the random variable ${\displaystyle X}$ is assumed to follow a normal distribution ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$. Then, in this case, the parameter vector ${\displaystyle \theta =(\mu ,\sigma )\in \Theta }$ is unknown, and the parameter space ${\displaystyle \Theta =\{(\mu ,\sigma ):\mu \in \mathbb {R} ,\sigma >0\}}$. It is often useful to estimate those unknown parameters in some ways to "understand" the random variable ${\displaystyle X}$ better. We would like to make sure the estimation should be "good" [3] enough, so that the understanding is more accurate.

Intuitively, the (realization of) random sample ${\displaystyle X_{1},\dotsc ,X_{n}}$ should be useful. Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what point estimates mean. To be more precise, let us define point estimation and point estimates.

Definition. (Point estimation) Point estimation is a process of using the value of a statistic to give a single value estimate (which can be interpreted as a point) of an unknown parameter.

Remark.

• Recall that statistics are functions of a random sample.
• We call the unknown parameter as population parameter (since the underlying distribution corresponding to the parameter is called a population).
• The statistics is called a point estimator, and its realization is called a point estimate.
• The notation of point estimator commonly has a ${\displaystyle {\hat {}}}$.
• Point estimation will be contrasted with interval estimation, which uses the value of a statistic to estimate an interval of plausible values of the unknown parameter.

Example. Suppose ${\displaystyle X_{1},\dotsc ,X_{n}}$ are ${\displaystyle n}$ random samples from the normal distribution ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$.

• We may use the statistic ${\displaystyle {\overline {X}}={\frac {X_{1}+\dotsb +X_{n}}{n}}}$ to estimate ${\displaystyle \mu }$ intuitively, and ${\displaystyle {\overline {X}}}$ is called the point estimator, and its realization ${\displaystyle {\overline {x}}}$ is called the point estimate.
• Alternatively, we may simply use the statistic ${\displaystyle X_{1}}$ (despite it does not involve ${\displaystyle X_{2},\dotsc ,X_{n}}$, it can still be regarded as function of ${\displaystyle X_{1},\dotsc ,X_{n}}$) to estimate ${\displaystyle \mu }$. That is, we use the value of the first random sample from the normal distribution as the point estimate of the mean of the distribution! Intuitively, it may seem that such estimator is quite "bad".
• Such estimator, which just takes one random sample directly, is called a single observation estimator.
• We will later discuss how to evaluate how "good" a point estimator is.

In the following, we will introduce two well-known point estimators, which are actually quite "good", namely maximum likelihood estimator and method of moment estimator.

## Maximum likelihood estimator (MLE)

As suggested by the name of this estimator, it is the estimator that maximize some kind of "likelihood". Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way). Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence, this "likelihood" should be also based on the random sample in some sense.

To motivate the definition of maximum likelihood estimator, consider the following example.

Example. In a random experiment, a (fair or unfair) coin is tossed once. Let the random variable ${\displaystyle X=1}$ if head comes up, and ${\displaystyle 0}$ otherwise. Then, the pmf of ${\displaystyle X}$ is ${\displaystyle f(x;p)=p^{x}(1-p)^{1-x},\quad x\in \{0,1\}}$, in which the unknown parameter ${\displaystyle p}$ represents the probability for head comes up, and ${\displaystyle p\in \Theta =\{p:p\in (0,1)\}}$.

Now, suppose you get a random sample ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$ by tossing that coin ${\displaystyle n}$ independent times (such random sample is called an independent random sample, since the random variables involved are independent), and the corresponding realizations ${\displaystyle x_{1},x_{2},\dotsc ,x_{n}}$. Then, the probability for ${\displaystyle X_{1}=x_{1},X_{2}=x_{2},\dotsc ,{\text{ and }}X_{n}=x_{n}}$, i.e., the random sample have these realizations exactly, is {\displaystyle {\begin{aligned}\mathbb {P} (X_{1}=x_{1}\cap X_{2}=x_{2}\cap \dotsb \cap X_{n}=x_{n})&=\mathbb {P} (X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2})\dotsb \mathbb {P} (X_{n}=x_{n})&{\text{by independence}}\\&=f(x_{1};p)f(x_{2};p)\dotsb f(x_{n};p)\\&=p^{x_{1}}(1-p)^{1-x_{1}}p^{x_{2}}(1-p)^{1-x_{2}}\dotsb p^{x_{n}}(1-p)^{1-x_{n}}\\&=p^{x_{1}+x_{2}+\dotsb +x_{n}}(1-p)^{n-x_{1}-x_{2}-\dotsb -x_{n}}.\end{aligned}}}

Remark.

• Remark on notation: You may observe that there is an additional "${\displaystyle ;p}$" in the pmf of ${\displaystyle X}$. Such notation means the pmf is with the parameter value ${\displaystyle p}$. It is included to emphasize the parameter value we are referring to.
• In general, we write ${\displaystyle f(\cdot ;\theta )}$ for pmf/pdf with the parameter value ${\displaystyle \theta }$ (${\displaystyle \theta }$ may be a vector).
• There are some alternative notations with the same meaning: ${\displaystyle f(\cdot |\theta ),f_{\theta }(\cdot ),\dotsc }$.
• Similarly, we have similar notations, e.g. ${\displaystyle \mathbb {P} _{\theta }(A),\mathbb {P} (A|\theta ),\mathbb {P} (A;\theta ),\dotsc }$, to mean the probability for event ${\displaystyle A}$ to happen, with the parameter value ${\displaystyle \theta }$. (It is more common to use the first notation: ${\displaystyle \mathbb {P} _{\theta }(A)}$.)
• We also have similar notations for mean, variance, covariance, etc., like ${\displaystyle \mathbb {E} _{\theta }[\cdot ],\operatorname {Var} _{\theta }(\cdot ),\operatorname {Cov} _{\theta }(\cdot ),\dotsc }$

Intuitively, with these particular realizations (fixed), we would like to find a value of ${\displaystyle p}$ that maximizes this probability, i.e.,, makes the realizations obtained to be the one that is "most probable" or "with maximum likelihood". Now, let us formally define the terms related to MLE.

Definition. (Likelihood function) Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample with a joint pmf or pdf ${\displaystyle f}$, and the parameter (vector) ${\displaystyle \theta \in \Theta }$ (${\displaystyle \Theta }$ is the parameter space). Suppose ${\displaystyle x_{1},\dotsc ,x_{n}}$ are the corresponding realizations of the random sample ${\displaystyle X_{1},\dotsc ,X_{n}}$. Then, the likelihood function, denoted by ${\displaystyle {\mathcal {L}}(\theta ;x_{1},\dotsc ,x_{n})}$, is the function ${\displaystyle \theta \mapsto f(x_{1},\dotsc ,x_{n};\theta )}$ (${\displaystyle \theta }$ is a variable, and ${\displaystyle x_{1},\dotsc ,x_{n}}$ are fixed).

Remark.

• For simplicity, we may use the notation ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )}$ instead of ${\displaystyle {\mathcal {L}}(\theta ;x_{1},\dotsc ,x_{n})}$. Sometimes, we may also just write "${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )}$" for convenience.
• When we replace ${\displaystyle x_{1},\dotsc ,x_{n}}$ by ${\displaystyle X_{1},\dotsc ,X_{n}}$, then the resulting "likelihood function" becomes a random variable, and we denote it by ${\displaystyle {\mathcal {L}}(\theta ;X_{1},\dotsc ,X_{n})}$ or ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {X} )}$.
• The likelihood function is in contrast with the joint pmf or pdf itself, where ${\displaystyle \theta }$ is fixed and ${\displaystyle x_{1},\dotsc ,x_{n}}$ are variables.
• When the random sample comes from a discrete distribution, then the value of likelihood function is the probability ${\displaystyle \mathbb {P} (X_{1}=x_{1}\cap \dotsb \cap X_{n}=x_{n})}$ at the parameter vector ${\displaystyle \theta }$. That is, the probability for getting this specific realization exactly.
• When the random sample comes from a continuous distribution, then the value of likelihood function is not a probability. Instead, it is only the value of the joint pdf at ${\displaystyle (x_{1},\dotsc ,x_{n})}$ (which can be greater than one). However, the value can still be used to "reflect" the probability for getting "very close to" this specific realization, where the probability can be obtained by integrating the joint pdf over a "very small" region around ${\displaystyle (x_{1},\dotsc ,x_{n})}$.
• The natural logarithm of the likelihood function, ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ (or ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {X} )}$ sometimes), is called the log-likelihood function.
• Notice that the "expression" of the likelihood function is actually the same as that of the joint pdf, and just the inputs are different. So, one may still integrate/sum the likelihood function with respect to ${\displaystyle x_{1},\dotsc ,x_{n}}$ (which changes the likelihood function to the joint pdf/pmf in such context in some sense) as if it is the joint pdf/pmf to get probabilities.

Definition. (Maximum likelihood estimate) Given a likelihood function ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )}$, a maximum likelihood estimate of the parameter ${\displaystyle \theta }$ is a value ${\displaystyle {\hat {\theta }}(\mathbf {x} )}$ at which ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized.

Remark.

• The maximum likelihood estimator (MLE) of ${\displaystyle \theta }$ is ${\displaystyle {\hat {\theta }}(\mathbf {X} )}$ (obtained by replacing "${\displaystyle x}$" in ${\displaystyle {\hat {\theta }}(\mathbf {x} )}$ by "${\displaystyle X}$").
• In some other places, the abbreviation MLE can also mean maximum likelihood estimate depending on the context. However, we will just use the abbreviation MLE when we are talking about maximum likelihood estimator here.
• Since ${\displaystyle {\frac {d}{dy}}\ln y={\frac {1}{y}}>0}$ (the domain of natural logarithm function is the set of all positive real numbers), the natural logarithm function is strictly increasing, i.e., the output is larger when the input is larger. Thus, when we find a value at which ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized, ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )}$ is also maximized at the same value.

Now, let us find the MLE of the unknown parameter ${\displaystyle p}$ in the previous coin flipping example.

Example. (Motivating example revisited) Recall that we use a coin flipping example to motivate maximum likelihood estimation. ${\displaystyle X}$ follows the Bernoulli distribution with success probability ${\displaystyle p}$. The pmf of ${\displaystyle X}$ is ${\displaystyle f(x;p)=p^{x}(1-p)^{1-x}}$. ${\displaystyle X_{1},\dotsc ,X_{n}}$ is a random sample from the distribution.

• The likelihood function ${\displaystyle {\mathcal {L}}(p)}$ is the joint pmf of ${\displaystyle X_{1},\dotsc ,X_{n}}$,

{\displaystyle {\begin{aligned}\mathbb {P} (X_{1}=x_{1}\cap \dotsb \cap X_{n}=x_{n})&=\prod _{i=1}^{n}f(x_{i};p)&{\text{by independence}}\\&=\prod _{i=1}^{n}p^{x_{i}}(1-p)^{1-x_{i}}\\\end{aligned}}}

• The log-likelihood function ${\displaystyle \ln {\mathcal {L}}(p)}$ is thus

{\displaystyle {\begin{aligned}\ln {\mathcal {L}}(p)&=\sum _{i=1}^{n}\ln(p^{x_{i}}(1-p)^{1-x_{i}})\\&=\sum _{i=1}^{n}(\ln(p^{x_{i}})+\ln((1-p)^{1-x_{i}}))\\&=\sum _{i=1}^{n}(x_{i}\ln(p)+(1-x_{i})\ln(1-p))\\&=\sum _{i=1}^{n}(x_{i}\ln(p))+\sum _{i=1}^{n}((1-x_{i})\ln(1-p))\\&=\ln(p)\sum _{i=1}^{n}(x_{i})+\ln(1-p)\sum _{i=1}^{n}(1-x_{i})\\&=\ln(p)\sum _{i=1}^{n}(x_{i})+\ln(1-p)\left(n-\sum _{i=1}^{n}(x_{i})\right)\\\end{aligned}}}

• To find the maximum of the log-likelihood function, we may use derivative test learnt in Calculus. Differentiating ${\displaystyle \ln {\mathcal {L}}(p)}$ with respect to ${\displaystyle p}$ gives

{\displaystyle {\begin{aligned}{\frac {d\ln {\mathcal {L}}(p)}{dp}}&={\frac {1}{\color {blue}p}}\underbrace {\sum _{i=1}^{n}x_{i}} _{{\text{constant wrt }}p}-{\frac {1}{\color {red}1-p}}\underbrace {\left(n-\sum _{i=1}^{n}x_{i}\right)} _{{\text{constant wrt }}p}\\&={\frac {{\color {red}(1-p)}\sum _{i=1}^{n}x_{i}-n{\color {blue}p}+{\color {blue}p}\sum _{i=1}^{n}x_{i}}{{\color {blue}p}{\color {red}(1-p)}}}\\&={\frac {(1-p)n{\overline {x}}-np+pn{\overline {x}}}{p(1-p)}}&\left(\sum _{i=1}^{n}x_{i}=n{\overline {x}}=n\cdot {\frac {\sum _{i=1}^{n}x_{i}}{n}}\right)\\&={\frac {n({\overline {x}}-p)}{p(1-p)}}\end{aligned}}}

• To find critical point(s) of ${\displaystyle \ln {\mathcal {L}}(p)}$, we set ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}=0\implies {\frac {n({\overline {x}}-p)}{p(1-p)}}=0\implies p={\overline {x}}}$ (we have ${\displaystyle p(1-p)\neq 0}$)
• To verify that ${\displaystyle \ln {\mathcal {L}}(p)}$ actually attains maximum (instead of minimum) at ${\displaystyle p={\overline {x}}}$, we need to perform derivative test. In this case, we use first derivative test.
• We can see that ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}>0}$ when ${\displaystyle p<{\overline {x}}}$, which makes ${\displaystyle {\overline {x}}-p>0}$, and thus ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}>0}$. On the other hand, when ${\displaystyle p>{\overline {x}}}$, this makes ${\displaystyle {\overline {x}}-p<0}$, and thus ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}<0}$. As a result, we can conclude that ${\displaystyle \ln {\mathcal {L}}(p)}$ attains its maximum at ${\displaystyle p={\overline {x}}}$. It follows that the MLE of ${\displaystyle p}$ is ${\displaystyle {\overline {X}}}$ (not ${\displaystyle {\overline {x}}}$, which is instead maximum likelihood estimate!)

Exercise. Use second derivative test to verify that ${\displaystyle \ln {\mathcal {L}}(p)}$ attains maximum at ${\displaystyle p={\overline {x}}}$.

Solution
• Since ${\displaystyle {\frac {d^{2}\ln {\mathcal {L}}(p)}{dp^{2}}}={\frac {-np(1-p)-n({\overline {x}}-p)(2p)}{p^{2}(1-p)^{2}}}}$, in which the numerator is negative and the denominator is positive. Thus, ${\displaystyle {\frac {d^{2}\ln {\mathcal {L}}(p)}{dp^{2}}}<0}$. By second derivative test, this means ${\displaystyle \ln {\mathcal {L}}(p)}$ attains maximum at ${\displaystyle p={\overline {x}}}$.

Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a restricted MLE. We will illustrate this in the following example.

Example. Continue from the previous coin flipping example. Suppose we have a constraint on ${\displaystyle p}$ where ${\displaystyle 0\leq p\leq {\frac {1}{2}}}$. Find the MLE of ${\displaystyle p}$ in this case.

Solution: For the steps about deriving likelihood function and log-likelihood function, they are the same in this case. Without the restriction, the MLE of ${\displaystyle p}$ is ${\displaystyle {\overline {X}}}$. Now, with the restriction, the MLE of ${\displaystyle p}$ is ${\displaystyle {\overline {X}}}$ only when ${\displaystyle {\overline {X}}\leq {\frac {1}{2}}}$ (we always have ${\displaystyle {\overline {X}}\geq 0}$ since ${\displaystyle X\geq 0}$).

If ${\displaystyle {\overline {X}}>{\frac {1}{2}}}$ (and thus ${\displaystyle {\overline {x}}>1/2}$), even though ${\displaystyle \ln {\mathcal {L}}(p)}$ is maximized at ${\displaystyle p={\overline {x}}}$, we cannot set the MLE to be ${\displaystyle {\overline {X}}}$ due to the restriction on ${\displaystyle p}$: ${\displaystyle 0\leq p\leq {\frac {1}{2}}}$. Under this case, this means ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}>0}$ when ${\displaystyle p\leq {\frac {1}{2}}<{\overline {X}}}$ (we have ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}>0}$ when ${\displaystyle p<{\overline {x}}}$ from previous example), i.e., ${\displaystyle \ln {\mathcal {L}}(p)}$ is strictly increasing when ${\displaystyle p\leq {\frac {1}{2}}}$. Thus, ${\displaystyle \ln {\mathcal {L}}(p)}$ is maximized when ${\displaystyle p={\frac {1}{2}}}$ with the restriction. As a result, the MLE of ${\displaystyle p}$ is ${\displaystyle {\frac {1}{2}}}$ (the MLE can be a constant, which can still be regarded as a function of ${\displaystyle X_{1},\dotsc ,X_{n}}$).

Therefore, the MLE of ${\displaystyle p}$ can be written as a case defined function: ${\displaystyle {\hat {\theta }}={\begin{cases}{\overline {X}},&{\overline {X}}\leq {\frac {1}{2}}\\{\frac {1}{2}},&{\overline {X}}>{\frac {1}{2}}\end{cases}}}$, or it can be written as ${\displaystyle {\hat {\theta }}=\min \left\{{\overline {X}},{\frac {1}{2}}\right\}}$

Exercise. Find the MLE of ${\displaystyle p}$ when ${\displaystyle {\frac {1}{2}}\leq p\leq 1}$.

Solution
• When ${\displaystyle {\overline {X}}<{\frac {1}{2}}}$, we cannot set the MLE to be ${\displaystyle {\overline {X}}}$ due to the restriction. In this case, we know that ${\displaystyle {\frac {d\ln {\mathcal {L}}(p)}{dp}}<0}$ when ${\displaystyle p\geq {\frac {1}{2}}>{\overline {X}}}$, i.e., ${\displaystyle \ln {\mathcal {L}}(p)}$ is strictly decreasing when ${\displaystyle {\frac {1}{2}}\leq p\leq 1}$. Thus, ${\displaystyle \ln {\mathcal {L}}(p)}$ is maximized at ${\displaystyle p={\frac {1}{2}}}$, and so the MLE of ${\displaystyle p}$ is ${\displaystyle {\frac {1}{2}}}$.
• When ${\displaystyle {\overline {X}}\geq {\frac {1}{2}}}$, we can set the MLE to be ${\displaystyle {\overline {X}}}$ at which ${\displaystyle \ln {\mathcal {L}}(p)}$ is maximized, and so ${\displaystyle {\overline {X}}}$ is the MLE of ${\displaystyle p}$ in this case.
• Therefore, the MLE of ${\displaystyle p}$ is ${\displaystyle {\hat {\theta }}=\max \left\{{\overline {X}},{\frac {1}{2}}\right\}}$.

To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the uniform distribution ${\displaystyle {\mathcal {U}}[0,\beta ]}$. Find the MLE of ${\displaystyle \beta }$.

Solution: The pdf of the uniform distribution is ${\displaystyle f(x;\beta )={\frac {1}{\beta }}\mathbf {1} \{0\leq x\leq \beta \}}$. Thus, the likelihood function is ${\displaystyle {\mathcal {L}}(\beta )=\prod _{i=1}^{n}{\frac {1}{\beta }}\mathbf {1} \{0\leq x_{i}\leq \beta \}={\frac {1}{\beta ^{n}}}\prod _{i=1}^{n}\mathbf {1} \{0\leq x_{i}\leq \beta \}}$.

In order for ${\displaystyle {\mathcal {L}}(\beta )}$ to attain maximum, first, we need to ensure that ${\displaystyle 0\leq x_{i}\leq \beta }$ for each ${\displaystyle i\in \{1,\dotsc ,n\}}$, so that the product of the indicator functions in the likelihood function is nonzero (the value is actually one in this case). Apart from that, since ${\displaystyle \beta \mapsto {\frac {1}{\beta ^{n}}}}$ is a strictly decreasing function of ${\displaystyle \beta }$ (because ${\displaystyle {\frac {d}{d\beta }}\left({\frac {1}{\beta ^{n}}}\right)={\frac {-n}{\beta ^{n+1}}}<0}$ (we have ${\displaystyle n,\beta >0}$)), we should pick a ${\displaystyle \beta }$ that is as small as possible so that ${\displaystyle {\frac {1}{\beta ^{n}}}}$, and hence ${\displaystyle {\mathcal {L}}(\beta )}$, is as large as possible.

As a result, we should choose a ${\displaystyle \beta }$ that is as small as possible, subject to the constraint that ${\displaystyle 0\leq x_{i}\leq \beta }$ for each ${\displaystyle i\in \{1,\dotsc ,n\}}$, which means that ${\displaystyle \beta \geq x_{i}}$ (it is always the case that ${\displaystyle x_{i}\geq 0}$, regardless of the choice of ${\displaystyle \beta }$) for each ${\displaystyle i\in \{1,\dotsc ,n\}}$. It follows that ${\displaystyle {\mathcal {L}}(\beta )}$ attains maximum when ${\displaystyle \beta }$ is the maximum of ${\displaystyle x_{1},\dotsc ,x_{n}}$. Hence, the MLE of ${\displaystyle \beta }$ is ${\displaystyle {\hat {\beta }}=\max\{X_{1},\dotsc ,X_{n}\}}$.

Exercise. Show that the MLE of ${\displaystyle \beta }$ does not exist if the uniform distribution becomes ${\displaystyle {\mathcal {U}}[0,\beta )}$.

Solution

Proof. In this case, the constraint from the indicator functions become ${\displaystyle 0\leq x_{i}<\beta }$ for each ${\displaystyle i\in \{1,\dotsc ,n\}}$. With similar argument, for the MLE of ${\displaystyle \beta }$, we should choose a ${\displaystyle \beta }$ that is as small as possible subject to this constraint, which means ${\displaystyle \beta >x_{i}}$ for each ${\displaystyle i\in \{1,\dotsc ,n\}}$. However, in this case, we cannot set ${\displaystyle \beta }$ to be the maximum of ${\displaystyle x_{1},\dotsc ,x_{n}}$, or else the constraint will not be satisfied and the likelihood function becomes zero due to the indicator function. Instead, we should set ${\displaystyle \beta }$ to be slightly greater than the maximum of ${\displaystyle x_{1},\dotsc ,x_{n}}$, so that the constraint can still be satisifed, and ${\displaystyle \beta }$ is quite small. However, for each such ${\displaystyle \beta >\max\{x_{1},\dotsc ,x_{n}\}}$, we can always chooses a smaller ${\displaystyle \beta }$ that still satisfies the constraint. For example, for each ${\displaystyle \beta }$, the smaller beta, ${\displaystyle \beta '}$ can be selected as ${\displaystyle \max\{x_{1},\dotsc ,x_{n}\}+{\frac {\beta -\max\{x_{1},\dotsc ,x_{n}\}}{2}}>\max\{x_{1},\dotsc ,x_{n}\}}$ [4]. Hence, we cannot find a minimum value of ${\displaystyle \beta }$ subject to this constraint. Thus, there is no maximum point for ${\displaystyle \ln {\mathcal {L}}(p)}$, and hence the MLE does not exist.

${\displaystyle \Box }$

In the following example, we will find the MLE of a parameter vector.

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the normal distribution with mean ${\displaystyle \theta _{1}}$ and variance ${\displaystyle \theta _{2}}$, ${\displaystyle {\mathcal {N}}(\theta _{1},\theta _{2})}$. Find the MLE of ${\displaystyle (\theta _{1},\theta _{2})}$.

Solution: Let ${\displaystyle \theta =(\theta _{1},\theta _{2})}$. The likelihood function is ${\displaystyle {\mathcal {L}}(\theta ;\mathbf {x} )=\prod _{i=1}^{n}{\frac {1}{\sqrt {2\pi \theta _{2}}}}\exp \left(-{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}\right)=(2\pi \theta _{2})^{-n/2}\exp \left(-\sum _{i=1}^{n}{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}\right)}$, and hence the log-likelihood function is ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )=-{\frac {n}{2}}\ln(2\pi \theta _{2})-\sum _{i=1}^{n}{\frac {(x_{i}-\theta _{1})^{2}}{2\theta _{2}}}}$. Since this function is multivariate, we may use the second partial derivative test from multivariable calculus to find maximum point(s). However, in this case, we actually do not need to use such test. Instead, we fix the variables one by one to make the function univariate, so that we can use the derivative test for univariate function to find maximum point (with another variable fixed).

Since ${\displaystyle {\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}}}={\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})}$ and ${\displaystyle {\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}}}=-{\frac {2n\pi }{4\pi \theta _{2}}}+{\frac {1}{2\theta _{2}^{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}=-{\frac {n}{2\theta _{2}}}+{\frac {1}{2\theta _{2}^{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}$.

Also, ${\displaystyle {\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}}}=0\implies \sum _{i=1}^{n}(x_{i}-\theta _{1})=0\implies -n\theta _{1}+\sum _{i=1}^{n}x_{i}=0\implies \theta _{1}={\frac {\sum _{i=1}^{n}x_{i}}{n}}={\overline {x}}}$, which is independent from ${\displaystyle \theta _{2}}$ (this is important for us to use this kind of method) and ${\displaystyle {\frac {\partial \ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}}}=0\implies {\frac {n}{2\theta _{2}}}={\frac {1}{2\theta _{2}^{2}}}\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)\implies n={\frac {1}{\theta _{2}}}\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)\implies \theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}}$.

Since ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{1}^{2}}}={\frac {\partial }{\partial \theta _{1}}}\left({\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})\right)={\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(-1)={\frac {-n}{\theta _{2}}}<0}$, by the second derivative test (for univariate function), ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized at ${\displaystyle \theta _{1}={\overline {x}}}$, given any fixed ${\displaystyle \theta _{2}}$.

On the other hand, since ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}^{2}}}={\frac {n}{2\theta _{2}^{2}}}-{\frac {1}{\theta _{2}^{3}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}$, and thus ${\displaystyle \left.{\frac {\partial ^{2}\ln {\mathcal {L}}(\theta ;\mathbf {x} )}{\partial \theta _{2}^{2}}}\right\vert _{\theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}}={\frac {1}{2n\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}-{\frac {n^{3}}{\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}={\frac {1-2n^{4}}{2n\left(\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right)^{2}}}<0}$ (since ${\displaystyle 2n^{4}>1}$).

Thus, by the second derivative test, ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized at ${\displaystyle \theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}}{n}}}$, given any fixed ${\displaystyle \theta _{1}}$.

So, now we fix ${\displaystyle \theta _{1}={\overline {x}}}$, and thus we have ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized at ${\displaystyle \theta _{2}={\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}}{n}}=s^{2}}$, where ${\displaystyle s^{2}}$ is the realization of the sample variance ${\displaystyle S^{2}}$. Now, fix ${\displaystyle \theta _{2}}$ to be ${\displaystyle s^{2}}$, and we know that ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ attains maximum at ${\displaystyle \theta _{1}={\overline {x}}}$ for each fixed ${\displaystyle \theta _{2}}$, including this fixed ${\displaystyle \theta _{2}=s^{2}}$. As a result, ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ is maximized at ${\displaystyle (\theta _{1},\theta _{2})=({\overline {x}},s^{2})}$. Hence, the MLE of ${\displaystyle (\theta _{1},\theta _{2})}$ is ${\displaystyle ({\overline {X}},S^{2})}$.

Exercise.

(a) Calculate the determinant of the Hessian matrix of ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ at ${\displaystyle (\theta _{1},\theta _{2})=({\overline {x}},s^{2})}$, which can be expressed as ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2}){\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}^{2}}}({\overline {x}},s^{2})-\left({\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}\partial \theta _{1}}}({\overline {x}},s^{2})\right)^{2}}$.

(b) Hence, verify that ${\displaystyle (\theta _{1},\theta _{2})=({\overline {x}},s^{2})}$ is the maximum point of ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ using the second partial derivative test.

Solution

(a) First,

• ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {-n}{\theta _{2}}}\right\vert _{\theta _{2}=s^{2}}={\frac {-n}{s^{2}}}}$
• ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}^{2}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {n}{2\theta _{2}^{2}}}-{\frac {1}{\theta _{2}^{3}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})^{2}\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}={\frac {n}{2(s^{2})^{2}}}-{\frac {1}{(s^{2})^{3}}}\cdot ns^{2}={\frac {n}{2(s^{2})^{2}}}-{\frac {n}{(s^{2})^{2}}}={\frac {-n}{2(s^{2})^{2}}}}$
• ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{2}\partial \theta _{1}}}({\overline {x}},s^{2}){\overset {\text{above}}{=}}\left.{\frac {\partial }{\partial \theta _{2}}}\left({\frac {1}{\theta _{2}}}\sum _{i=1}^{n}(x_{i}-\theta _{1})\right)\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}=\left.-{\frac {\sum _{i=1}^{n}(x_{i}-\theta _{1})}{\theta _{2}^{2}}}\right\vert _{(\theta _{1},\theta _{2})=({\overline {x}},s^{2})}=-{\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})}{(s^{2})^{2}}}=-{\frac {\sum _{i=1}^{n}(x_{i})-n{\overline {x}}}{(s^{2})^{2}}}=-{\frac {n{\overline {x}}-n{\overline {x}}}{(s^{2})^{2}}}=0}$

As a result, the determinant of the Hessian matrix is ${\displaystyle {\frac {-n}{s^{2}}}\cdot {\frac {-n}{2(s^{2})^{2}}}={\frac {n^{2}}{2(s^{2})^{3}}}}$.

(b) From (a), the determinant of the Hessian matrix is positive. Also, ${\displaystyle {\frac {\partial ^{2}\ln {\mathcal {L}}}{\partial \theta _{1}^{2}}}({\overline {x}},s^{2})=-{\frac {n}{s^{2}}}<0}$. Thus, by the second partial derivative test, ${\displaystyle \ln {\mathcal {L}}(\theta ;\mathbf {x} )}$ attains maximum at ${\displaystyle (\theta _{1},\theta _{2})=({\overline {x}},s^{2})}$.

Exercise. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the exponential distribution with rate parameter ${\displaystyle \lambda }$, with pdf ${\displaystyle f(x;\lambda )=\lambda e^{-\lambda x},\quad x\geq 0}$, where ${\displaystyle \lambda >0}$. Show that the MLE of ${\displaystyle \lambda }$ is ${\displaystyle {\frac {1}{\overline {X}}}}$.

Solution

Proof. The likelihood function is ${\displaystyle {\mathcal {L}}(\lambda )=\prod _{i=1}^{n}(\lambda e^{-\lambda x_{i}})=\lambda ^{n}\exp \left(-\lambda \sum _{i=1}^{n}x_{i}\right)}$. Thus, the log-likelihood function is ${\displaystyle \ln {\mathcal {L}}(\lambda )=n\ln \lambda -\lambda \sum _{i=1}^{n}x_{i}}$. Differentiating the log-likelihood function with respect to ${\displaystyle \lambda }$ gives ${\displaystyle {\frac {d}{d\lambda }}\ln {\mathcal {L}}(\lambda )={\frac {n}{\lambda }}-\sum _{i=1}^{n}x_{i}}$. Setting the derivative to be zero, we get ${\displaystyle {\frac {n}{\lambda }}-\sum _{i=1}^{n}x_{i}=0\implies {\frac {n}{\lambda }}-n{\overline {x}}=0\implies {\frac {1}{\lambda }}={\overline {x}}\implies \lambda ={\frac {1}{\overline {x}}}}$. It remains to verify that ${\displaystyle \ln {\mathcal {L}}(\lambda )}$ attains maximum at ${\displaystyle \lambda ={\frac {1}{\overline {x}}}}$. Since ${\displaystyle {\frac {d^{2}}{d\lambda ^{2}}}\ln {\mathcal {L}}(\lambda )=-{\frac {n}{\lambda ^{2}}}<0}$, this is verified. Hence, the MLE of ${\displaystyle \lambda }$ is ${\displaystyle {\frac {1}{\overline {X}}}}$.

${\displaystyle \Box }$

Example. (Application of maximum likelihood estimation) Suppose you are given a box which contains four balls, with unknown number of red and black balls. Now, you draw three balls out of the box, and find out that you get two red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution: Given the color of the balls drawn, we know that the box contains at least two red balls and at least one black ball. This means the box contains either two red balls or three red balls. Let ${\displaystyle r}$ be the number of red balls inside the box. Then, the number of black balls inside the box is ${\displaystyle 4-r}$. The possible values of parameter ${\displaystyle r}$ are 2 and 3.

Now, we compare the probability of getting such result from drawing three balls when ${\displaystyle r=2}$ and ${\displaystyle r=3}$.

• For ${\displaystyle r=2}$, the probability is ${\displaystyle {\frac {{\binom {2}{2}}{\binom {2}{1}}}{\binom {4}{3}}}={\frac {1}{2}}}$ (consider the pmf of hypergeometric distribution).
• For ${\displaystyle r=3}$, the probability is ${\displaystyle {\frac {{\binom {3}{2}}{\binom {1}{1}}}{\binom {4}{3}}}={\frac {3}{4}}}$.

Hence, the maximum likelihood estimate of ${\displaystyle r}$ is 3. Thus, the estimated number of the red balls is 3, and that of the black balls is 1.

Exercise. Suppose the box now contains 100 balls, with unknown number of red and black balls. Now, you draw 99 balls out of the box, and find out that you get 98 red balls and one black ball. Using maximum likelihood estimation, estimate the number of red and black balls inside the box.

Solution

Similarly, the box contains at least 98 red balls and one black ball. We use the same notation as in the above example. Then, the number of the black balls is ${\displaystyle 100-r}$, and the possible values of parameter ${\displaystyle r}$ are 98 and 99.

• For ${\displaystyle r=98}$, the probability is ${\displaystyle {\frac {{\binom {98}{98}}{\binom {2}{1}}}{\binom {100}{99}}}=0.02}$
• For ${\displaystyle r=99}$, the probability is ${\displaystyle {\frac {{\binom {99}{98}}{\binom {1}{1}}}{\binom {100}{99}}}=0.99}$

Thus, the maximum likelihood estimate of ${\displaystyle r}$ is 99. Thus, the estimated number of the red balls is 99 and that of the black balls is 1.

Remark.

• The difference of the probabilities between two possible values of ${\displaystyle r}$ becomes much larger in this case.
• Intuitively, when you have such draw result, you will think that it is quite unlikely that the box has two black balls inside, i.e., the ball that is not drawn is actually black, and somehow you draw all red balls out, but not the black ball.

## Method of moments estimator (MME)

For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution. However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution, e.g. mean, variance, and some moments (${\displaystyle r}$th moment of a random variable ${\displaystyle X}$ is ${\displaystyle \mathbb {E} [X^{r}]}$, we denote it by ${\displaystyle \mu _{r}}$ for simplicity). Such moments often contain information about the unknown parameter. For example, for a normal distribution ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$, we know that ${\displaystyle \mu =\mu _{1}}$ and ${\displaystyle \sigma ^{2}=\mu _{2}-(\mu _{1})^{2}}$. Because of this, when we want to estimate the parameters, we can do this through estimating the moments.

Now, we would like to know how to estimate the moments. We let ${\displaystyle m_{r}={\frac {\sum _{i=1}^{n}X_{i}^{r}}{n}}}$ be the ${\displaystyle r}$th sample moment [5], where ${\displaystyle X_{i}}$'s are independent and identically distributed. By weak law of large number (assuming the conditions are satisified), we have

• ${\displaystyle {\overline {X}}=m_{1}\;{\overset {p}{\to }}\;\mathbb {E} [X]=\mu _{1}}$
• ${\displaystyle m_{2}\;{\overset {p}{\to }}\;\mathbb {E} [X^{2}]=\mu _{2}}$ (this can be seen from replacing the "${\displaystyle X}$" by "${\displaystyle X^{2}}$" in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)

In general, we have ${\displaystyle m_{r}\;{\overset {p}{\to }}\;\mu _{r}}$, since the conditions are still satisfied after replacing the "${\displaystyle X}$" by "${\displaystyle X^{r}}$" in the weak law of large number.

Because of these results, we can estimate the ${\displaystyle r}$-th moment ${\displaystyle \mu _{r}}$ using the ${\displaystyle r}$-th sample moment ${\displaystyle m_{r}}$, and the estimation is "better" when ${\displaystyle n}$ is large. For example, in the above normal distribution example, we can estimate ${\displaystyle \mu }$ by ${\displaystyle m_{1}}$ and ${\displaystyle \sigma ^{2}}$ by ${\displaystyle m_{2}-(m_{1})^{2}}$, and these estimators are actually called the method of moments estimator.

To be more precise, we have the following the definition of the method of moments:

Definition. (Method of moments) Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from a distribution with pdf or pmf ${\displaystyle f(x;\theta _{1},\dotsc ,\theta _{k})}$. Write ${\displaystyle k}$ moment(s), e.g. ${\displaystyle \mu _{1},\dotsc ,\mu _{k}}$, as function(s) of ${\displaystyle \theta _{1},\dotsc ,\theta _{k}}$: ${\displaystyle g_{1}(\theta _{1},\dotsc ,\theta _{k}),\dotsc ,g_{k}(\theta _{1},\dotsc ,\theta _{k})}$ respectively. Then, the method of moments estimator (MME) of ${\displaystyle \theta _{1},\dotsc ,\theta _{k}}$, ${\displaystyle {\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k}}$ respectively, is given by the solution (in the form of ${\displaystyle {\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k}}$ in terms of ${\displaystyle m_{1},\dotsc ,m_{k}}$, corresponding to the ${\displaystyle k}$ moments ${\displaystyle \mu _{1},\dotsc ,\mu _{k}}$) to the following system of equations: ${\displaystyle {\begin{cases}m_{1}=g_{1}({\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k})\\\vdots \\m_{k}=g_{k}({\hat {\theta }}_{1},\dotsc ,{\hat {\theta }}_{k})\\\end{cases}}}$

Remark.

• When there are ${\displaystyle k}$ unknown parameters, we need to solve a system of ${\displaystyle k}$ equations, involving ${\displaystyle k}$ sample moments.
• Usually, we select the first ${\displaystyle k}$ moments for the ${\displaystyle k}$ moments, as in the definition. But this is not necessary, and we may choose other moments, including fractional moments (e.g. ${\displaystyle \mathbb {E} [X^{1/2}]}$, and we use ${\displaystyle m_{1/2}}$ in this case).
• Because of this, the method of moment estimator is not unique.

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the normal distribution ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$. Find the MME of ${\displaystyle \mu }$ and ${\displaystyle \sigma ^{2}}$.

Solution: First, there are two unknown parameters. Thus, we need to solve aa system of 2 equations, involving 2 sample moments and 2 moments. Since ${\displaystyle \mu _{1}=\mu }$ and ${\displaystyle \mu _{2}=\sigma ^{2}+(\mu )^{2}}$, consider the following system of equations: ${\displaystyle {\begin{cases}m_{1}={\hat {\mu }}&(1)\\m_{2}={\widehat {\sigma ^{2}}}+({\hat {\mu }})^{2}&(2)\\\end{cases}}}$ Substituting ${\displaystyle (1)}$ into ${\displaystyle (2)}$, we get ${\displaystyle m_{2}={\widehat {\sigma ^{2}}}+(m_{1})^{2}\Leftrightarrow {\widehat {\sigma ^{2}}}=m_{2}-(m_{1})^{2}}$. Hence, the MME of ${\displaystyle \mu }$ is ${\displaystyle {\hat {\mu }}=m_{1}}$ and the MME of ${\displaystyle \sigma ^{2}}$ is ${\displaystyle {\widehat {\sigma ^{2}}}=m_{2}-(m_{1})^{2}}$.

Remark.

• We can see that the process of finding the MME of ${\displaystyle \mu }$ and ${\displaystyle \sigma ^{2}}$ is much easier than finding the MLE of ${\displaystyle \mu }$ and ${\displaystyle \sigma ^{2}}$. This is because the expression of the first and second moment in terms of parameters is simple in this case. However, when the expression is more complicated, finding the MME of the parameters can be quite complicated.

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the exponential distribution with rate parameter ${\displaystyle \lambda }$. Find the MME of ${\displaystyle \lambda }$ and compare it to the MLE of ${\displaystyle \lambda }$.

Solution: Since ${\displaystyle \mu _{1}={\frac {1}{\lambda }}}$, consider the following equation: ${\displaystyle m_{1}={\frac {1}{\hat {\lambda }}}}$. We then have ${\displaystyle {\hat {\lambda }}={\frac {1}{m_{1}}}}$. Hence, the MME of ${\displaystyle \lambda }$ is ${\displaystyle {\hat {\lambda }}={\frac {1}{m_{1}}}={\frac {1}{\overline {X}}}}$, which is somehow the same as the MLE of ${\displaystyle \lambda }$.

Exercise. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the uniform distribution ${\displaystyle {\mathcal {U}}[a,b]}$. Show that the MMEs of ${\displaystyle a}$ and ${\displaystyle b}$ are ${\displaystyle {\hat {a}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}}$ and ${\displaystyle {\hat {b}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}}$ respectively.

Solution

Proof. Since ${\displaystyle \mu _{1}={\frac {a+b}{2}}}$ and ${\displaystyle \mu _{2}={\frac {(b-a)^{2}}{12}}+\left({\frac {a+b}{2}}\right)^{2}={\frac {b^{2}-2ab+a^{2}+3a^{2}+6ab+3b^{2}}{12}}={\frac {4a^{2}+4b^{2}+4ab}{12}}={\frac {a^{2}+b^{2}+ab}{3}}}$, consider the following system of equations: ${\displaystyle {\begin{cases}m_{1}=({\hat {a}}+{\hat {b}})/2&(1)\\m_{2}=({\hat {a}}^{2}+{\hat {a}}{\hat {b}}+{\hat {b}}^{2})/3&(b)\\\end{cases}}}$ From ${\displaystyle (1)}$, we have ${\displaystyle {\hat {b}}=2m_{1}-{\hat {a}}}$. Substituting it into ${\displaystyle (2)}$, we have ${\displaystyle m_{2}={\big (}{\hat {a}}^{2}+{\hat {a}}(2m_{1}-{\hat {a}})+(2m_{1}-{\hat {a}})^{2}{\big )}/3-{\big (}{\hat {a}}^{2}+2m_{1}{\hat {a}}-{\hat {a}}^{2}+4m_{1}^{2}-4m_{1}{\hat {a}}+{\hat {a}}^{2}{\big )}/3\Leftrightarrow {\hat {a}}^{2}-2m_{1}{\hat {a}}+4m_{1}^{2}=3m_{2}\Leftrightarrow {\hat {a}}^{2}-2m_{1}{\hat {a}}+4m_{1}^{2}-3m_{2}=0}$ Solving this equation by quadratic formula, we get ${\displaystyle {\hat {a}}={\frac {2m_{1}\pm {\sqrt {4m_{1}^{2}-4(4m_{1}^{2}-3m_{2})}}}{2}}={\frac {2m_{1}\pm {\sqrt {12m_{2}-12m_{1}^{2}}}}{2}}={\frac {2m_{1}\pm 2{\sqrt {3(m_{2}-m_{1}^{2})}}}{2}}=m_{1}\pm {\sqrt {3(m_{2}-m_{1}^{2})}}}$.

When ${\displaystyle {\hat {a}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}}$, ${\displaystyle {\hat {b}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}<{\hat {a}}}$. However, from the definition of the uniform distribution, we need to have ${\displaystyle {\hat {a}}<{\hat {b}}}$, and thus this case is rejected.

When ${\displaystyle {\hat {a}}=m_{1}-{\sqrt {3(m_{2}-m_{1}^{2})}}}$, ${\displaystyle {\hat {b}}=m_{1}+{\sqrt {3(m_{2}-m_{1}^{2})}}>{\hat {a}}}$, which satisfies the definition of the uniform distribution.

Thus, we have the desired result.

${\displaystyle \Box }$

## Properties of estimator

In this section, we will introduce some criteria for evaluating how "good" a point estimator is, namely unbiasedness, efficienecy and consistency.

### Unbiasedness

For ${\displaystyle {\hat {\theta }}}$ to be a "good" estimator of a parameter ${\displaystyle \theta }$, a desirable property of ${\displaystyle {\hat {\theta }}}$ is that its expected value equals the value of the parameter ${\displaystyle \theta }$, or at least close to the value. Because of this, we introduce a value, namely bias, to measure how close is the mean of ${\displaystyle {\hat {\theta }}}$ to ${\displaystyle \theta }$.

Definition. (Bias) The bias of an estimator ${\displaystyle {\hat {\theta }}}$ is ${\displaystyle \operatorname {Bias} ({\hat {\theta }})=\mathbb {E} [{\hat {\theta }}]-\theta .}$

We will also define some terms related to bias.

Definition. ((Un)biased estimator) An estimator ${\displaystyle {\hat {\theta }}}$ is an unbiased estimator of a parameter ${\displaystyle \theta }$ if ${\displaystyle \operatorname {Bias} ({\hat {\theta }})=0}$. Otherwise, the estimator is called a biased estimator.

Definition. (Asymptotically unbiased estimator) An estimator ${\displaystyle {\hat {\theta }}}$ is an asymptotically unbiased estimator of a parameter ${\displaystyle \theta }$ if ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0}$ where ${\displaystyle n}$ is the sample size.

Remark.

• An unbiased estimator must be an asymptotically unbiased estimator, but the converse is not true, i.e., an asymptotically unbiased estimator may not be an unbiased estimator. Thus, a biased estimator may be an asymptotically unbiased estimator.
• When we discuss the goodness of estimators in terms of unbiasedness, an unbiased estimator is better than an asymptotically unbiased estimator, which is better than an unbiased estimator.
• However, there are also other criteria for evaluating the goodness of estimators apart from unbiasedness, so when we also account for other criteria, a biased estimator may be somehow "better" than an unbiased estimator overall.

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the Bernoulli distribution with success probability ${\displaystyle p}$. Show that the MLE of ${\displaystyle p}$, ${\displaystyle {\overline {X}}}$, is an unbiased estimator of ${\displaystyle p}$.

Proof. Since ${\displaystyle \mathbb {E} [{\overline {X}}]={\frac {1}{n}}\cdot \mathbb {E} \left[\sum _{i=1}^{n}X_{i}\right]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\cdot \sum _{i=1}^{n}p={\frac {np}{n}}=p}$, the result follows.

${\displaystyle \Box }$

Exercise. Suppose the Bernoulli distribution is replaced by binomial distribution with ${\displaystyle n}$ trials and success probability ${\displaystyle p}$. Show that ${\displaystyle {\overline {X}}}$ is a biased estimator of ${\displaystyle p}$. Modify this estimator such that it is an unbiased estimator of ${\displaystyle p}$.

Solution

Proof. Since ${\displaystyle \mathbb {E} [{\overline {X}}]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\sum _{i=1}^{n}np=np\neq p}$, ${\displaystyle {\overline {X}}}$ is a biased estimator of ${\displaystyle p}$.

${\displaystyle \Box }$

We can modify this estimator to ${\displaystyle {\frac {\overline {X}}{n}}}$, and then its mean is ${\displaystyle {\frac {np}{n}}=p}$. Alternatively, we may choose the estimator to be ${\displaystyle {\frac {X_{i}}{n}}}$ (${\displaystyle i\in \{1,\dotsc ,n\}}$), whose mean is also ${\displaystyle p}$ (Other estimators whose mean is ${\displaystyle p}$ are also fine).

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from the normal distribution ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$. Show that the MLE of ${\displaystyle \mu }$, ${\displaystyle {\overline {X}}}$, is an unbiased estimator of ${\displaystyle \mu }$, and the MLE of ${\displaystyle \sigma ^{2}}$, ${\displaystyle S^{2}}$, is an asymptotically unbiased estimator of ${\displaystyle \sigma ^{2}}$.

Proof. First, since ${\displaystyle \mathbb {E} [{\overline {X}}]={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} [X_{i}]={\frac {1}{n}}\sum _{i=1}^{n}\mu =\mu }$, ${\displaystyle {\overline {X}}}$ is an unbiased estimator of ${\displaystyle \mu }$.

On the other hand, {\displaystyle {\begin{aligned}\mathbb {E} [S^{2}]&={\frac {1}{n}}\sum _{i=1}^{n}\mathbb {E} \left[(X_{i}-{\overline {X}})^{2}\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left(X_{i}-{\overline {X}}\right)&{\text{since }}\mathbb {E} [X_{i}-{\overline {X}}]=\mathbb {E} [X_{i}]-\mathbb {E} [{\overline {X}}]=\mu -\mu =0\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left(X_{i}-{\frac {X_{1}+\dotsb +X_{i-1}+X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {{\color {blue}n}X_{i}}{n}}-{\frac {X_{1}+\dotsb +X_{i-1}+X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+({\color {blue}n}-1)X_{i}+\dotsb +X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}+{\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}X_{n}}{n}}\right)\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}\right)+\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}}{n}}\right)\right]&{\text{by independence}}\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[\operatorname {Var} \left({\frac {(n-1)X_{i}}{n}}\right)+\operatorname {Var} \left({\frac {X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}}{n}}\right)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)^{2}}{n^{2}}}\sigma ^{2}+{\frac {1}{n^{2}}}\operatorname {Var} \left(X_{1}+\dotsb +X_{i-1}+X_{i+1}+\dotsb +X_{n}\right)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)^{2}}{n^{2}}}\sigma ^{2}+{\frac {n-1}{n^{2}}}\sigma ^{2}\right]&{\text{by iid}}\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {\sigma ^{2}}{n^{2}}}(n^{2}-2n+1+n-1)\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n^{2}-n)\sigma ^{2}}{n^{2}}}\right]\\&={\frac {1}{n}}\sum _{i=1}^{n}\left[{\frac {(n-1)\sigma ^{2}}{n}}\right]\\&={\frac {1}{n}}\cdot n\cdot {\frac {(n-1)\sigma ^{2}}{n}}\\&={\frac {n-1}{n}}\sigma ^{2}\\\end{aligned}}} Thus, ${\displaystyle \lim _{n\to \infty }\mathbb {E} [S^{2}]=\lim _{n\to \infty }\left({\frac {n-1}{n}}\sigma ^{2}\right)=\sigma ^{2}\lim _{n\to \infty }\left(1-{\frac {1}{n}}\right)=\sigma ^{2}\left(1-\lim _{n\to \infty }{\frac {1}{n}}\right)=\sigma ^{2}}$, as desired.

${\displaystyle \Box }$

Exercise. Modify the estimator of ${\displaystyle \sigma ^{2}}$ such that it becomes an unbiased estimator.

Solution

The estimator can be modified as ${\displaystyle {\frac {n}{n-1}}S^{2}={\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n-1}}}$.

### Efficiency

We have discussed how to evaluate the unbiasedness of estimators. Now, if we are given two unbiased estimators, ${\displaystyle {\hat {\theta }}}$ and ${\displaystyle {\tilde {\theta }}}$, how should we compare their goodness? Their goodness is the same if we are only comparing them in terms of unbiasedness. Therefore, we need another criterion in this case. One possible way is to compare their variances, and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense. Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other. This is the idea of efficiency.

Definition. (Efficiency) Suppose ${\displaystyle {\color {blue}{\hat {\theta }}}}$ and ${\displaystyle {\color {red}{\tilde {\theta }}}}$ are two unbiased estimators of an unknown parameter ${\displaystyle \theta }$. The efficiency of ${\displaystyle {\color {blue}{\hat {\theta }}}}$ relative to ${\displaystyle {\color {red}{\tilde {\theta }}}}$ is ${\displaystyle \operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})={\frac {\operatorname {Var} ({\color {red}{\tilde {\theta }}})}{\operatorname {Var} ({\color {blue}{\hat {\theta }}})}}}$. If ${\displaystyle \operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})>1}$, then we say that ${\displaystyle {\color {blue}{\hat {\theta }}}}$ is relatively more efficient than ${\displaystyle {\color {red}{\tilde {\theta }}}}$.

Remark.

• Since ${\displaystyle \operatorname {Eff} ({\color {blue}{\hat {\theta }}},{\color {red}{\tilde {\theta }}})>1\Leftrightarrow \operatorname {Var} ({\color {blue}{\hat {\theta }}})<\operatorname {Var} ({\color {red}{\tilde {\theta }}})}$, the estimator with smaller variance is relatively more efficient than the estimator with larger variance.
• Normally, the variance should be nonzero, and thus the efficiency should be defined in normal cases.
• Sometimes, it is also called relative efficiency due to the fact that the efficiency describes ${\displaystyle \operatorname {Var} ({\color {red}{\tilde {\theta }}})}$ equals "how many" ${\displaystyle \operatorname {Var} ({\color {blue}{\hat {\theta }}})}$.
• One may ask that why we use the ratio of variances is used in the definition to compare variances, instead of using the difference in variances. A possible reason is that the ratio of variances does not have any unit (the unit of the variances (if exists) cancels out each other), but the difference in variances can have an unit. Also, using the ratio of variances allows us to also compare different efficiencies numerically, calculated from different variances.

Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter ${\displaystyle \theta }$, it measures the mean of the squared deviation from ${\displaystyle \theta }$, and we have a specific term for this deviation, namely mean squared error (MSE).

Definition. (Mean squared error) Suppose ${\displaystyle {\hat {\theta }}}$ is an estimator of a parameter ${\displaystyle \theta }$. The mean squared error (MSE) of ${\displaystyle {\hat {\theta }}}$ is ${\displaystyle \operatorname {MSE} ({\hat {\theta }})=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]}$.

Remark.

• From this definition, ${\displaystyle \operatorname {MSE} ({\hat {\theta }})}$ is the mean value of the square of error ${\displaystyle {\hat {\theta }}-\theta }$, and hence the name mean squared error.

Notice that in the definition of MSE, we do not specify that ${\displaystyle {\hat {\theta }}}$ to be an unbiased estimator. Thus, ${\displaystyle {\hat {\theta }}}$ in the definition may be biased. We have mentioned that when ${\displaystyle {\hat {\theta }}}$ is unbiased, then its variance is actually its MSE. In the following, we will give a more general relationship between ${\displaystyle \operatorname {MSE} ({\hat {\theta }})}$ and ${\displaystyle \operatorname {Var} ({\hat {\theta }})}$, not just for unbiased estimators.

Proposition. (Relationship between mean squared error and variance) If ${\displaystyle \operatorname {Var} ({\hat {\theta }})}$ exists, then ${\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {Var} ({\hat {\theta }})+[\operatorname {Bias} ({\hat {\theta }})]^{2}}$.

Proof. By definition, we have ${\displaystyle \operatorname {MSE} ({\hat {\theta }})=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]}$ and ${\displaystyle \operatorname {Var} ({\hat {\theta }})=\mathbb {E} \left[({\hat {\theta }}-\mathbb {E} [{\hat {\theta }}])^{2}\right]}$. From these, we are motivated to write {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]\\&=\mathbb {E} \left[{\big (}({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})+({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta ){\big )}^{2}\right]\\&=\mathbb {E} [({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})^{2}+2({\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]})\underbrace {({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )} _{\text{constant}}+({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )^{2}]\\&=\operatorname {Var} ({\hat {\theta }})+2({\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}-\theta )\underbrace {\mathbb {E} [{\hat {\theta }}-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}]} _{=\mathbb {E} [{\hat {\theta }}]-{\color {darkgreen}\mathbb {E} [{\hat {\theta }}]}=0}+[\operatorname {Bias} ({\hat {\theta }})]^{2}\\&=\operatorname {Var} ({\hat {\theta }})+[\operatorname {Bias} ({\hat {\theta }})]^{2},\end{aligned}}} as desired.

${\displaystyle \Box }$

Example. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ (${\displaystyle n>1}$) be a random sample from ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$.

(a) Show that the single observation estimator ${\displaystyle X_{1}}$ is an unbiased estimator for ${\displaystyle \mu }$.

(b) Calculate the MSE of ${\displaystyle X_{1}}$ and ${\displaystyle {\overline {X}}}$ respectively.

(c) Which of ${\displaystyle X_{1}}$ and ${\displaystyle {\overline {X}}}$ is a better estimator of ${\displaystyle \mu }$ in terms of unbiasedness and efficiency?

Solution:

(a) Since ${\displaystyle \mathbb {E} [X_{1}]=\mu }$, the result follows.

(b) ${\displaystyle \operatorname {MSE} (X_{1})=\operatorname {Var} (X_{1})+0^{2}=\sigma ^{2}}$, and ${\displaystyle \operatorname {MSE} ({\overline {X}})=\operatorname {Var} ({\overline {X}})={\frac {1}{n^{2}}}\sum _{i=1}^{n}\operatorname {Var} (X_{i})={\frac {n\sigma ^{2}}{n^{2}}}={\frac {\sigma ^{2}}{n}}}$.

(c) Since ${\displaystyle \operatorname {MSE} ({\overline {X}})<\operatorname {MSE} (X_{1})\Leftrightarrow \operatorname {Var} ({\overline {X}})<\operatorname {Var} (X_{1})}$, ${\displaystyle {\overline {X}}}$ is relatively more efficient than ${\displaystyle X_{1}}$. Since both ${\displaystyle X_{1}}$ and ${\displaystyle {\overline {X}}}$ are unbiased estimators of ${\displaystyle \mu }$, we conclude that ${\displaystyle {\overline {X}}}$ is a better estimator of ${\displaystyle \mu }$ in terms of unbiasedness and efficiency.

Exercise. In addition to the random sample with sample size ${\displaystyle n}$ in the example, suppose we take another random sample with sample size ${\displaystyle m}$. Let ${\displaystyle {\overline {X}}^{(n)}}$ and ${\displaystyle {\overline {X}}^{(m)}}$ denote the sample mean for the sample with sample size ${\displaystyle n}$ and ${\displaystyle m}$ respectively.

(a) Calculate ${\displaystyle \operatorname {Eff} \left({\overline {X}}^{(n)},{\overline {X}}^{(m)}\right)}$.

(b) State the condition on the sample sizes ${\displaystyle m}$ and ${\displaystyle n}$ under which ${\displaystyle {\overline {X}}^{(m)}}$ is relatively more efficient than ${\displaystyle {\overline {X}}^{(n)}}$.

Solution

(a) Since ${\displaystyle \operatorname {Var} \left({\overline {X}}^{(n)}\right)={\frac {\sigma ^{2}}{n}}}$ (from example), and ${\displaystyle \operatorname {Var} \left({\overline {X}}^{(m)}\right)={\frac {\sigma ^{2}}{m}}}$ (by similar arguments as in the example), ${\displaystyle \operatorname {Eff} \left({\overline {X}}^{(n)},{\overline {X}}^{(m)}\right)={\frac {\sigma ^{2}/m}{\sigma ^{2}/n}}={\frac {n}{m}}}$.

(b) Since ${\displaystyle {\frac {n}{m}}>1\Leftrightarrow n>m}$, the condition is ${\displaystyle n>m}$.

Remark.

• This shows that the sample mean with a larger sample size is relatively more efficient than the one with smaller sample size.

Proposition. ${\displaystyle \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0}$ if and only if ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0}$ and ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0}$.

Proof.

• "if" part is simple. Assume ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0}$ and ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0}$. Then, ${\displaystyle \lim _{n\to \infty }(\operatorname {Var} ({\hat {\theta }})+(\operatorname {Bias} ({\hat {\theta }}))^{2})=0\Rightarrow \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0}$.
• "only if" part: we can use proof by contrapositive, i.e., proving that if ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\neq 0}$ or ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0}$, then ${\displaystyle \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})\neq 0}$.
• Case 1: when ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\neq 0}$, it means ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})>0}$ since the variance is nonnegative. Also, ${\displaystyle \lim _{n\to \infty }(\operatorname {Bias} ({\hat {\theta }}))^{2}\geq 0}$. It follows that ${\displaystyle \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})>0}$, i.e., the MSE does not equal zero.
• Case 2: when ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})\neq 0}$, it means ${\displaystyle \lim _{n\to \infty }(\operatorname {Bias} ({\hat {\theta }}))^{2}>0}$. Also, ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})\geq 0}$. It follows that ${\displaystyle \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})>0}$, i.e., the MSE does not equal zero.

${\displaystyle \Box }$

Remark.

• As a result, if we know that ${\displaystyle \lim _{n\to \infty }\operatorname {MSE} ({\hat {\theta }})=0}$, then we know that ${\displaystyle \lim _{n\to \infty }\operatorname {Bias} ({\hat {\theta }})=0}$, i.e., ${\displaystyle {\hat {\theta }}}$ is an asymptotically unbiased estimator (in addition to ${\displaystyle \lim _{n\to \infty }\operatorname {Var} ({\hat {\theta }})=0}$) (${\displaystyle {\hat {\theta }}}$ may be an unbiased estimator).

#### Uniformly minimum-variance unbiased estimator

Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is. Thus, it is natural that we want to know what is the most efficient (i.e., the "best") unbiased estimator, i.e., the unbiased estimator with the smallest variance. We have a specific name for such unbiased estimator, namely uniformly minimum-variance unbiased estimator (UMVUE) [6]. To be more precise, we have the following definition for UMVUE:

Definition. (Uniformly minimum-variance unbiased estimator) The uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator with the smallest variance among all unbiased estimators.

Indeed, UMVUE is unique, i.e., there is exactly one unbiased estimator with the smallest variance among all unbiased estimators, and we will prove it in the following.

Proposition. (Uniqueness of UMVUE) If ${\displaystyle W}$ is an UMVUE of a function of a parameter ${\displaystyle \tau (\theta )}$, then ${\displaystyle W}$ is unique.

Proof. Assume that ${\displaystyle W}$ is an UMVUE of ${\displaystyle \tau (\theta )}$, and ${\displaystyle W'}$ is another UMVUE of ${\displaystyle \tau (\theta )}$. Define the estimator ${\displaystyle W^{*}={\frac {1}{2}}(W+W')}$. Since ${\displaystyle \mathbb {E} [W^{*}]={\frac {1}{2}}(\mathbb {E} [W]+\mathbb {E} [W'])={\frac {1}{2}}(\tau (\theta +\theta )=\tau (\theta )}$, ${\displaystyle W^{*}}$ is an unbiased estimator of ${\displaystyle \tau (\theta )}$.

Now, we consider the variance of ${\displaystyle W^{*}}$. {\displaystyle {\begin{aligned}\operatorname {Var} (W^{*})&={\frac {1}{4}}\operatorname {Var} (W+W')\\&={\frac {1}{4}}\left[\operatorname {Var} (W)+\operatorname {Var} (W')+2\operatorname {Cov} (W,W')\right]\\&\leq {\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{4}}\operatorname {Var} (W')+{\frac {1}{2}}{\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}&({\text{covariance inequality}})\\&={\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{4}}\operatorname {Var} (W)+{\frac {1}{2}}{\sqrt {(\operatorname {Var} (W))^{2}}}&(\operatorname {Var} (W)=\operatorname {Var} (W'){\text{ since }}W{\text{ and }}W'{\text{ are both UMVUE}})\\&={\frac {1}{2}}\operatorname {Var} (W)+{\frac {1}{2}}\operatorname {Var} (W)&(\operatorname {Var} (W)>0)\\&=\operatorname {Var} (W).\end{aligned}}} Thus, we now have either ${\displaystyle \operatorname {Var} (W^{*})<\operatorname {Var} (W)}$ or ${\displaystyle \operatorname {Var} (W^{*})=\operatorname {Var} (W)}$. If the former is true, then ${\displaystyle W}$ is not an UMVUE of ${\displaystyle \tau (\theta )}$ by definition, since we can find another unbiased estimator, namely ${\displaystyle W^{*}}$, with smaller variance than it. Hence, we must have the latter, i.e., ${\displaystyle \operatorname {Var} (W^{*})=\operatorname {Var} (W).}$ This implies when we apply the covariance inequality, the equality holds, i.e., ${\displaystyle \operatorname {Cov} (W,W')={\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}\iff \rho (W',W)=1,}$ which means ${\displaystyle W'}$ is increasing linearly with ${\displaystyle W}$, i.e., we can write ${\displaystyle W'=aW+b}$ for some constants ${\displaystyle a>0}$ and ${\displaystyle b}$.

Now, we consider the covariance ${\displaystyle \operatorname {Cov} (W,W')}$. ${\displaystyle \operatorname {Cov} (W,W'){\overset {\text{ above }}{=}}\operatorname {Cov} (W,aW+b){\overset {\text{ properties }}{=}}a\operatorname {Cov} (W,W){\overset {\text{ property }}{=}}a\operatorname {Var} (W).}$ On the other hand, since the equality holds in the covariance inequality, and ${\displaystyle \operatorname {Var} (W)=\operatorname {Var} (W')}$ (since they are both UMVUE), ${\displaystyle \operatorname {Cov} (W,W')={\sqrt {\operatorname {Var} (W)\operatorname {Var} (W')}}={\sqrt {(\operatorname {Var} (W))^{2}}}=\operatorname {Var} (W).}$ Thus, we have ${\displaystyle a=1}$.

It remains to show that ${\displaystyle b=0}$ to prove that ${\displaystyle W=W'}$, and therefore conclude that ${\displaystyle W}$ is unique.

From above, we currently have ${\displaystyle W'=W+b\implies \mathbb {E} [W']=\mathbb {E} [W]+b\implies \tau (\theta )=\tau (\theta )+b\implies b=0}$, as desired.

${\displaystyle \Box }$

Remark.

• Thus, when we are able to find an UMVUE, then it is the unique one, and the variance every other possible unbiased estimator is strictly greater than the variance of the UMVUE.
##### Cramer-Rao lower bound

Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.

Therefore, we will introduce some approaches that help us to find the UMVUE. For the first approach, we find a lower bound [7] on the variances of all possible unbiased estimators. After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.

Remark.

• There are many possible lower bounds, but when the lower bound is greater, it is closer to the actual minimum value of the variances, and hence "better".
• An unbiased estimator can still be an UMVUE even if its variance does not achieve the lower bound.

A common way to find such lower bound is to use the Cramer-Rao lower bound (CRLB), and we get the CRLB through Cramer-Rao inequality. Before stating the inequality, let us define some related terms.

Definition. (Fisher information) The Fisher information about a parameter ${\displaystyle \theta }$ with sample size ${\displaystyle n}$ is ${\displaystyle {\mathcal {I}}_{n}(\theta )=\mathbb {E} \left[\left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right)^{2}\right]}$ where ${\displaystyle \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )=\ln {\mathcal {L}}({\boldsymbol {\theta }};X_{1},\dotsc ,X_{n})}$ is the log-likelihood function (as a random variable).

Remark.

• ${\displaystyle {\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}}$ is called the score function, and is denoted by ${\displaystyle S(\theta ;\mathbf {X} )}$.
• The "${\displaystyle {\boldsymbol {\theta }}}$" may or may not be a parameter vector. If it is just a single parameter (usually the case here), then it is the same as "${\displaystyle \theta }$". We use "${\displaystyle {\boldsymbol {\theta }}}$" instead of "${\displaystyle \theta }$" to emphasize that the "${\displaystyle \theta }$" in ${\displaystyle {\mathcal {I}}_{n}(\theta )}$ and ${\displaystyle S(\theta ;\mathbf {X} )}$ is referring to the "${\displaystyle \theta }$" in "${\displaystyle {\frac {\partial }{\partial \theta }}}$"
• It is possible to define "Fisher information about a parameter vector", but in this case the Fisher information takes the form of a matrix instead of a single number, and it is called Fisher information matrix. However, since it is more complicated, we will not discuss it here.
• Since the expected value of the score function

${\displaystyle \mathbb {E} [S(\theta ;\mathbf {X} )]\mathbb {E} \left[{\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right]=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\cdot {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\frac {\partial {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}{{\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}}\cdot {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}=\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\frac {\partial {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )}{\partial \theta }}\,dx_{n}\cdots \,dx_{1},}$

and under some regularity conditions which allow interchange of derivative and integral, this equals ${\displaystyle {\frac {\partial }{\partial \theta }}\int _{-\infty }^{\infty }\dotsi \int _{-\infty }^{\infty }{\mathcal {L}}({\boldsymbol {\theta }};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}={\frac {\partial }{\partial \theta }}(1)=0}$, the Fisher information about ${\displaystyle \theta }$ is also the variance of the score function, i.e., ${\displaystyle \operatorname {Var} (S(\theta ;\mathbf {X} ))=\operatorname {Var} \left({\frac {\partial \ln {\mathcal {L}}({\boldsymbol {\theta }};\mathbf {X} )}{\partial \theta }}\right)}$.

For the regularity conditions which allow interchange of derivative and integral, they include

1. the partial derivatives involved should exist, i.e., the (natural log) of the functions involved is differentiable
2. the integrals involved should be differentiable
3. the support does not depend on the parameter(s) involved

We have some results that assist us to compute the Fisher information.

Proposition. Let ${\displaystyle X_{1},\dotsc ,X_{n}}$ be a random sample from a distribution with pdf or pmf ${\displaystyle f}$. Also, let ${\displaystyle {\mathcal {I}}(\theta )=\mathbb {E} \left[\left({\frac {\partial \ln f(X;{\boldsymbol {\theta }})}{\partial \theta }}\right)^{2}\right]}$, the Fisher information about ${\displaystyle \theta }$ with sample size one. Then, under some regularity conditions which allow interchange of derivative and integral, ${\displaystyle {\mathcal {I}}_{n}(\theta )=n{\mathcal {I}}(\theta )}$.

Proof.