Usually, a random variable resulting from a random experiment is assumed to follow a certain distribution with an unknown (but fixed) parameter (vector)  ( is a positive integer, and its value depends on the distribution), taking value in a set , called the parameter space.
In the context of frequentist statistics (the context here), parameters are regarded as fixed.
On the other hand, in the context of Bayesian statistics, parameters are regarded as random variables.
For example, suppose the random variable is assumed to follow a normal distribution . Then, in this case, the parameter vector is unknown, and the parameter space .
It is often useful to estimate those unknown parameters in some ways to "understand" the random variable better.
We would like to make sure the estimation should be "good"  enough, so that the understanding is more accurate.
Intuitively, the (realization of) random sample should be useful.
Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what point estimates mean.
To be more precise, let us define point estimation and point estimates.
Point estimation is a process of using the value of a statistic to give a single value estimate (which can be interpreted as a point) of an unknown parameter.
Recall that statistics are functions of a random sample.
We call the unknown parameter as population parameter (since the underlying distribution corresponding to the parameter is called a population).
The statistics is called a point estimator, and its realization is called a point estimate.
The notation of point estimator commonly has a .
Point estimation will be contrasted with interval estimation, which uses the value of a statistic to estimate an interval of plausible values of the unknown parameter.
Suppose are random samples from the normal distribution .
We may use the statistic to estimate intuitively, and is called the point estimator, and its realization is called the point estimate.
Alternatively, we may simply use the statistic (despite it does not involve , it can still be regarded as function of ) to estimate . That is, we use the value of the first random sample from the normal distribution as the point estimate of the mean of the distribution! Intuitively, it may seem that such estimator is quite "bad".
Such estimator, which just takes one random sample directly, is called a single observation estimator.
We will later discuss how to evaluate how "good" a point estimator is.
In the following, we will introduce two well-known point estimators, which are actually quite "good", namely maximum likelihood estimator and method of moment estimator.
As suggested by the name of this estimator, it is the estimator that maximize some kind of "likelihood".
Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way).
Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence,
this "likelihood" should be also based on the random sample in some sense.
To motivate the definition of maximum likelihood estimator, consider the following example.
In a random experiment, a (fair or unfair) coin is tossed once. Let the random variable if head comes up, and otherwise.
Then, the pmf of is , in which the unknown parameter represents
the probability for head comes up, and .
Now, suppose you get a random sample by tossing that coin independent times (such random sample is called an independent random sample, since the random variables involved are independent), and the corresponding realizations .
Then, the probability for , i.e., the random sample have these realizations exactly, is
Remark on notation: You may observe that there is an additional "" in the pmf of . Such notation means the pmf is with the parameter value . It is included to emphasize the parameter value we are referring to.
In general, we write for pmf/pdf with the parameter value ( may be a vector).
There are some alternative notations with the same meaning: .
Similarly, we have similar notations, e.g. , to mean the probability for event to happen, with the parameter value . (It is more common to use the first notation: .)
We also have similar notations for mean, variance, covariance, etc., like
Intuitively, with these particular realizations (fixed), we would like to find a value of that maximizes this probability, i.e.,, makes the realizations obtained to be the one that is "most probable" or "with maximum likelihood".
Now, let us formally define the terms related to MLE.
Let be a random sample with a joint pmf or pdf , and the parameter (vector) ( is the parameter space).
Suppose are the corresponding realizations of the random sample .
Then, the likelihood function, denoted by , is the function
( is a variable, and are fixed).
For simplicity, we may use the notation instead of . Sometimes, we may also just write "" for convenience.
When we replace by , then the resulting "likelihood function" becomes a random variable, and we denote it by or .
The likelihood function is in contrast with the joint pmf or pdf itself, where is fixed and are variables.
When the random sample comes from a discrete distribution, then the value of likelihood function is the probability at the parameter vector . That is, the probability for getting this specific realization exactly.
When the random sample comes from a continuous distribution, then the value of likelihood function is not a probability. Instead, it is only the value of the joint pdf at (which can be greater than one). However, the value can still be used to "reflect" the probability for getting "very close to" this specific realization, where the probability can be obtained by integrating the joint pdf over a "very small" region around .
The natural logarithm of the likelihood function, (or sometimes), is called the log-likelihood function.
Notice that the "expression" of the likelihood function is actually the same as that of the joint pdf, and just the inputs are different. So, one may still integrate/sum the likelihood function with respect to (which changes the likelihood function to the joint pdf/pmf in such context in some sense) as if it is the joint pdf/pmf to get probabilities.
(Maximum likelihood estimate)
Given a likelihood function,
a maximum likelihood estimate of the parameter is a value at which is maximized.
The maximum likelihood estimator (MLE) of is (obtained by replacing "" in by "").
In some other places, the abbreviation MLE can also mean maximum likelihood estimate depending on the context. However, we will just use the abbreviation MLE when we are talking about maximum likelihood estimator here.
Since (the domain of natural logarithm function is the set of all positive real numbers), the natural logarithm function is strictly increasing, i.e., the output is larger when the input is larger. Thus, when we find a value at which is maximized, is also maximized at the same value.
Now, let us find the MLE of the unknown parameter in the previous coin flipping example.
(Motivating example revisited)
Recall that we use a coin flipping example to motivate maximum likelihood estimation.
follows the Bernoulli distribution with success probability . The pmf of is . is a random sample from the distribution.
The likelihood function is the joint pmf of ,
The log-likelihood function is thus
To find the maximum of the log-likelihood function, we may use derivative test learnt in Calculus. Differentiating with respect to gives
To find critical point(s) of , we set (we have )
To verify that actually attains maximum (instead of minimum) at , we need to perform derivative test. In this case, we use first derivative test.
We can see that when , which makes , and thus . On the other hand, when , this makes , and thus . As a result, we can conclude that attains its maximum at . It follows that the MLE of is (not , which is instead maximum likelihood estimate!)
Use second derivative test to verify that attains maximum at .
Since , in which the numerator is negative and the denominator is positive. Thus, . By second derivative test, this means attains maximum at .
Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a restricted MLE. We will illustrate this in the following example.
Continue from the previous coin flipping example. Suppose we have a constraint on where .
Find the MLE of in this case.
For the steps about deriving likelihood function and log-likelihood function, they are the same in this case.
Without the restriction, the MLE of is . Now, with the restriction, the MLE of is only when (we always have since ).
If (and thus ), even though is maximized at , we cannot set the MLE to be due to the restriction on : .
Under this case, this means when (we have when from previous example), i.e., is strictly increasing when . Thus, is maximized when with the restriction. As a result, the MLE of is (the MLE can be a constant, which can still be regarded as a function of ).
Therefore, the MLE of can be written as a case defined function: ,
or it can be written as
Find the MLE of when .
When , we cannot set the MLE to be due to the restriction. In this case, we know that when , i.e., is strictly decreasing when . Thus, is maximized at , and so the MLE of is .
When , we can set the MLE to be at which is maximized, and so is the MLE of in this case.
Therefore, the MLE of is .
To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.
Let be a random sample from the uniform distribution . Find the MLE of .
The pdf of the uniform distribution is .
Thus, the likelihood function is
In order for to attain maximum, first, we need to ensure that for each , so that the product of the indicator functions in the likelihood function is nonzero (the value is actually one in this case).
Apart from that, since is a strictly decreasing function of (because (we have )), we should pick a that is as small as possible so that , and hence , is as large as possible.
As a result, we should choose a that is as small as possible, subject to the constraint that for each , which means that (it is always the case that , regardless of the choice of ) for each .
It follows that attains maximum when is the maximum of . Hence, the MLE of is .
Show that the MLE of does not exist if the uniform distribution becomes .
In this case, the constraint from the indicator functions become for each .
With similar argument, for the MLE of , we should choose a that is as small as possible subject to this constraint, which means for each .
However, in this case, we cannot set to be the maximum of , or else the constraint will not be satisfied and the likelihood function becomes zero due to the indicator function.
Instead, we should set to be slightly greater than the maximum of , so that the constraint can still be satisifed, and is quite small. However, for each such , we can always chooses a smaller that still satisfies the constraint. For example, for each , the smaller beta, can be selected as . Hence, we cannot find a minimum value of subject to this constraint. Thus, there is no maximum point for , and hence the MLE does not exist.
In the following example, we will find the MLE of a parameter vector.
Let be a random sample from the normal distribution with mean and variance , .
Find the MLE of .
Let . The likelihood function is
and hence the log-likelihood function is
Since this function is multivariate, we may use the second partial derivative test from multivariable calculus to find maximum point(s).
However, in this case, we actually do not need to use such test.
Instead, we fix the variables one by one to make the function univariate, so that we can use the derivative test for univariate function to find maximum point (with another variable fixed).
, which is independent from (this is important for us to use this kind of method) and
, by the second derivative test (for univariate function), is maximized at , given any fixed .
On the other hand, since
Thus, by the second derivative test, is maximized at , given any fixed .
So, now we fix , and thus we have is maximized at , where is the realization of the sample variance .
Now, fix to be , and we know that attains maximum at for each fixed , including this fixed .
As a result, is maximized at .
Hence, the MLE of is .
(a) Calculate the determinant of the Hessian matrix of at , which can be expressed as .
(b) Hence, verify that is the maximum point of using the second partial derivative test.
As a result, the determinant of the Hessian matrix is .
(b) From (a), the determinant of the Hessian matrix is positive. Also, . Thus, by the second partial derivative test, attains maximum at .
Let be a random sample from the exponential distribution with rate parameter , with pdf , where .
Show that the MLE of is .
The likelihood function is .
Thus, the log-likelihood function is .
Differentiating the log-likelihood function with respect to gives .
Setting the derivative to be zero, we get
It remains to verify that attains maximum at .
Since , this is verified.
Hence, the MLE of is .
(Application of maximum likelihood estimation)
Suppose you are given a box which contains four balls, with unknown number of red and black balls.
Now, you draw three balls out of the box, and find out that you get two red balls and one black ball.
Using maximum likelihood estimation, estimate the number of red and black balls inside the box.
Given the color of the balls drawn, we know that the box contains at least two red balls and at least one black ball.
This means the box contains either two red balls or three red balls.
Let be the number of red balls inside the box. Then, the number of black balls inside the box is .
The possible values of parameter are 2 and 3.
Now, we compare the probability of getting such result from drawing three balls when and .
For , the probability is (consider the pmf of hypergeometric distribution).
For , the probability is .
Hence, the maximum likelihood estimate of is 3.
Thus, the estimated number of the red balls is 3, and that of the black balls is 1.
Suppose the box now contains 100 balls, with unknown number of red and black balls.
Now, you draw 99 balls out of the box, and find out that you get 98 red balls and one black ball.
Using maximum likelihood estimation, estimate the number of red and black balls inside the box.
Similarly, the box contains at least 98 red balls and one black ball.
We use the same notation as in the above example.
Then, the number of the black balls is , and the possible values of parameter are 98 and 99.
For , the probability is
For , the probability is
Thus, the maximum likelihood estimate of is 99. Thus, the estimated number of the red balls is 99 and that of the black balls is 1.
The difference of the probabilities between two possible values of becomes much larger in this case.
Intuitively, when you have such draw result, you will think that it is quite unlikely that the box has two black balls inside, i.e., the ball that is not drawn is actually black, and somehow you draw all red balls out, but not the black ball.
For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution.
However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution,
e.g. mean, variance, and some moments (th moment of a random variable is , we denote it by for simplicity).
Such moments often contain information about the unknown parameter.
For example, for a normal distribution , we know that and .
Because of this, when we want to estimate the parameters, we can do this through estimating the moments.
Now, we would like to know how to estimate the moments.
We let be the th sample moment, where 's are independent and identically distributed.
By weak law of large number (assuming the conditions are satisified), we have
(this can be seen from replacing the "" by "" in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)
In general, we have , since the conditions are still satisfied after replacing the "" by "" in the weak law of large number.
Because of these results, we can estimate the -th moment using the -th sample moment , and the estimation is "better" when is large.
For example, in the above normal distribution example, we can estimate by and by ,
and these estimators are actually called the method of moments estimator.
To be more precise, we have the following the definition of the method of moments:
(Method of moments)
Let be a random sample from a distribution with pdf or pmf .
Write moment(s), e.g. , as function(s) of : respectively.
Then, the method of moments estimator (MME) of , respectively, is given by the solution (in the form of in terms of , corresponding to the moments ) to the following system of equations:
When there are unknown parameters, we need to solve a system of equations, involving sample moments.
Usually, we select the first moments for the moments, as in the definition. But this is not necessary, and we may choose other moments, including fractional moments (e.g. , and we use in this case).
Because of this, the method of moment estimator is not unique.
Let be a random sample from the normal distribution . Find the MME of and .
First, there are two unknown parameters. Thus, we need to solve aa system of 2 equations, involving 2 sample moments and 2 moments.
Since and , consider the following system of equations:
Substituting into , we get .
Hence, the MME of is and the MME of is .
We can see that the process of finding the MME of and is much easier than finding the MLE of and . This is because the expression of the first and second moment in terms of parameters is simple in this case. However, when the expression is more complicated, finding the MME of the parameters can be quite complicated.
Let be a random sample from the exponential distribution with rate parameter . Find the MME of and compare it to the MLE of .
Since , consider the following equation:
We then have . Hence, the MME of is ,
which is somehow the same as the MLE of .
Let be a random sample from the uniform distribution . Show that the MMEs of and are and respectively.
Since and ,
consider the following system of equations:
From , we have . Substituting it into , we have
Solving this equation by quadratic formula, we get .
When , . However, from the definition of the uniform distribution, we need to have , and thus this case is rejected.
When , , which satisfies the definition of the uniform distribution.
For to be a "good" estimator of a parameter , a desirable property of is that its expected value equals the value of the parameter , or at least close to the value.
Because of this, we introduce a value, namely bias, to measure how close is the mean of to .
The bias of an estimator is
We will also define some terms related to bias.
An estimator is an unbiased estimator of a parameter if .
Otherwise, the estimator is called a biased estimator.
(Asymptotically unbiased estimator)
An estimator is an asymptotically unbiased estimator of a parameter if
where is the sample size.
An unbiased estimator must be an asymptotically unbiased estimator, but the converse is not true, i.e., an asymptotically unbiased estimator may not be an unbiased estimator. Thus, a biased estimator may be an asymptotically unbiased estimator.
When we discuss the goodness of estimators in terms of unbiasedness, an unbiased estimator is better than an asymptotically unbiased estimator, which is better than an unbiased estimator.
However, there are also other criteria for evaluating the goodness of estimators apart from unbiasedness, so when we also account for other criteria, a biased estimator may be somehow "better" than an unbiased estimator overall.
Let be a random sample from the Bernoulli distribution with success probability . Show that the MLE of , , is an unbiased estimator of .
Since , the result follows.
Suppose the Bernoulli distribution is replaced by binomial distribution with trials and success probability . Show that is a biased estimator of . Modify this estimator such that it is an unbiased estimator of .
Since , is a biased estimator of .
We can modify this estimator to , and then its mean is . Alternatively, we may choose the estimator to be (), whose mean is also (Other estimators whose mean is are also fine).
Let be a random sample from the normal distribution . Show that the MLE of , , is an unbiased estimator of , and the MLE of , , is an asymptotically unbiased estimator of .
First, since , is an unbiased estimator of .
On the other hand,
Thus, , as desired.
Modify the estimator of such that it becomes an unbiased estimator.
We have discussed how to evaluate the unbiasedness of estimators.
Now, if we are given two unbiased estimators, and , how should we compare their goodness?
Their goodness is the same if we are only comparing them in terms of unbiasedness.
Therefore, we need another criterion in this case. One possible way is to compare their variances,
and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense.
Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other.
This is the idea of efficiency.
Suppose and are two unbiased estimators of an unknown parameter .
The efficiency of relative to is . If , then we say that is relatively more efficient than .
Since , the estimator with smaller variance is relatively more efficient than the estimator with larger variance.
Normally, the variance should be nonzero, and thus the efficiency should be defined in normal cases.
Sometimes, it is also called relative efficiency due to the fact that the efficiency describes equals "how many" .
One may ask that why we use the ratio of variances is used in the definition to compare variances, instead of using the difference in variances. A possible reason is that the ratio of variances does not have any unit (the unit of the variances (if exists) cancels out each other), but the difference in variances can have an unit. Also, using the ratio of variances allows us to also compare different efficiencies numerically, calculated from different variances.
Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter , it measures
the mean of the squared deviation from , and we have a specific term for this deviation, namely mean squared error (MSE).
(Mean squared error)
Suppose is an estimator of a parameter . The mean squared error (MSE) of is
From this definition, is the mean value of the square of error, and hence the name mean squared error.
Notice that in the definition of MSE, we do not specify that to be an unbiased estimator. Thus, in the definition may be biased.
We have mentioned that when is unbiased, then its variance is actually its MSE.
In the following, we will give a more general relationship between and , not just for unbiased estimators.
(Relationship between mean squared error and variance)
If exists, then .
By definition, we have and .
From these, we are motivated to write
Let () be a random sample from .
(a) Show that the single observation estimator is an unbiased estimator for .
(b) Calculate the MSE of and respectively.
(c) Which of and is a better estimator of in terms of unbiasedness and efficiency?
(a) Since , the result follows.
(b) , and .
(c) Since , is relatively more efficient than . Since both and are unbiased estimators of , we conclude that is a better estimator of in terms of unbiasedness and efficiency.
In addition to the random sample with sample size in the example, suppose we take another random sample with sample size .
Let and denote the sample mean for the sample with sample size and respectively.
(a) Calculate .
(b) State the condition on the sample sizes and under which is relatively more efficient than .
(a) Since (from example), and (by similar arguments as in the example),
(b) Since , the condition is .
This shows that the sample mean with a larger sample size is relatively more efficient than the one with smaller sample size.
Proposition. if and only if and .
"if" part is simple. Assume and . Then, .
"only if" part: we can use proof by contrapositive, i.e., proving that if or, then .
Case 1: when , it means since the variance is nonnegative. Also, . It follows that , i.e., the MSE does not equal zero.
Case 2: when , it means . Also, . It follows that , i.e., the MSE does not equal zero.
As a result, if we know that , then we know that , i.e., is an asymptotically unbiased estimator (in addition to ) (may be an unbiased estimator).
Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is.
Thus, it is natural that we want to know what is the most efficient (i.e., the "best") unbiased estimator, i.e., the unbiased estimator with the smallest variance.
We have a specific name for such unbiased estimator, namely uniformly minimum-variance unbiased estimator (UMVUE).
To be more precise, we have the following definition for UMVUE:
(Uniformly minimum-variance unbiased estimator)
The uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator with the smallest variance among all unbiased estimators.
Indeed, UMVUE is unique, i.e., there is exactly one unbiased estimator with the smallest variance among all unbiased estimators,
and we will prove it in the following.
(Uniqueness of UMVUE)
If is an UMVUE of a function of a parameter , then is unique.
Assume that is an UMVUE of , and is another UMVUE of .
Define the estimator .
Since , is an unbiased estimator of .
Now, we consider the variance of .
Thus, we now have either or .
If the former is true, then is not an UMVUE of by definition, since we can find another unbiased estimator, namely , with smaller variance than it.
Hence, we must have the latter, i.e.,
This implies when we apply the covariance inequality, the equality holds, i.e.,
which means is increasing linearly with , i.e., we can write
for some constants and .
Now, we consider the covariance .
On the other hand, since the equality holds in the covariance inequality, and (since they are both UMVUE),
Thus, we have .
It remains to show that to prove that , and therefore conclude that is unique.
From above, we currently have , as desired.
Thus, when we are able to find an UMVUE, then it is the unique one, and the variance every other possible unbiased estimator is strictly greater than the variance of the UMVUE.
Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.
Therefore, we will introduce some approaches that help us to find the UMVUE.
For the first approach, we find a lower bound on the variances of all possible unbiased estimators.
After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.
There are many possible lower bounds, but when the lower bound is greater, it is closer to the actual minimum value of the variances, and hence "better".
An unbiased estimator can still be an UMVUE even if its variance does not achieve the lower bound.
A common way to find such lower bound is to use the Cramer-Rao lower bound (CRLB), and we get the CRLB through Cramer-Rao inequality.
Before stating the inequality, let us define some related terms.
The Fisher information about a parameter with sample size is
where is the log-likelihood function (as a random variable).
is called the score function, and is denoted by .
The "" may or may not be a parameter vector. If it is just a single parameter (usually the case here), then it is the same as "". We use "" instead of "" to emphasize that the "" in and is referring to the "" in ""
It is possible to define "Fisher information about a parameter vector", but in this case the Fisher information takes the form of a matrix instead of a single number, and it is called Fisher information matrix. However, since it is more complicated, we will not discuss it here.
Since the expected value of the score function
and under some regularity conditions which allow interchange of derivative and integral, this equals , the Fisher information about is also the variance of the score function, i.e., .
For the regularity conditions which allow interchange of derivative and integral, they include
the partial derivatives involved should exist, i.e., the (natural log) of the functions involved is differentiable
the integrals involved should be differentiable
the support does not depend on the parameter(s) involved
We have some results that assist us to compute the Fisher information.
Let be a random sample from a distribution with pdf or pmf .
Also, let , the Fisher information about with sample size one.
Then, under some regularity conditions which allow interchange of derivative and integral, .
Under some regularity conditions which allow interchange of derivative and integral, .