Statistics/Hypothesis Testing

Introduction edit

In previous chapters, we have discussed two methods for estimating unknown parameters, namely point estimation and interval estimation. Estimating unknown parameters is an important area in statistical inference, and in this chapter we will discuss another important area, namely hypothesis testing, which is related to decision making. Indeed, the concepts of confidence intervals and hypothesis testing are closely related, as we will demonstrate.

Basic concepts and terminologies edit

Before discussing how to conduct hypothesis testing, and evaluate the "goodness" of a hypothesis test, let us introduce some basic concepts and terminologies related to hypothesis testing first.

Definition. (Hypothesis) A (statistical) hypothesis is a statement about population parameter(s).

There are two terms that classify hypotheses:

Definition. (Simple and composite hypothesis) A hypothesis is a simple hypothesis if it completely specifies the distribution of the population (that is, the distribution is completely known, without any unknown parameters involved), and is a composite hypothesis otherwise.

Sometimes, it is not immediately clear that whether a hypothesis is simple or composite. To understand the classification of hypotheses more clearly, let us consider the following example.

Example. Consider a distribution with parameter  , taking values in the parameter space  . Determine whether each of the following hypotheses is simple or composite.

(a)  .

(b)   where   is known.

(c)  .

(d)   where   is known.

(e)   where   is known.

(f)   where   is a nonempty subset of  . [1]

Solution.

  • (a) and (b) are simple hypotheses, since they all completely specifies the distribution.
  • (c), (d) and (e) are composite hypotheses, since the parameter   is not completely specified, then so is the distribution.
  • (f) may be simple hypothesis or composite hypothesis, depending on  . If   contains exactly one element, then it is simple hypothesis. Otherwise, it is composite hypothesis.

In hypothesis tests, we consider two hypotheses:

Definition. (Null hypothesis and alternative hypothesis) In hypothesis testing, the hypothesis being tested is the null hypothesis (denoted by  ) and another complementary hypothesis (to  ) is the alternative hypothesis (denoted by  ).

Remark.

  •   is complementary hypothesis to   in the sense that if   is true (false), then   is false (true) (exactly one of   and   is true). Because of this, we usually say   is tested against   (so we often write  ).
  • Usually,   usually corresponds to the status quo ("no effect"), and   corresponds to some interesting "research findings" (So,   is sometimes also called research hypothesis.).
  • Since   often corresponds to the status quo, we usually assume   is true, unless there are sufficient evidences against it.
  • This is somehow analogous to the legal principle of presumption of innocence which states that every person accused of any crime is considered innocent (  is assumed to be true), until proven guilty (there are sufficient evidences against  ).

A general form of   and   is   and   where  , which is the complement of   (with respect to  ), i.e.,   (  is the parameter space, containing all possible values of  ). The reason for choosing the complement of   in   is that   is the complementary hypothesis to  , as suggested in the above definition.

Remark.

  • In some books, it is only required that   and   to be disjoint (nonempty) subsets of the parameter space  , and it is not necessary that  .
  • However, usually it is still assumed that exactly one of   and   is true, so it means that   is not supposed to take values outside the set   (otherwise, none of   and   will be true).
  • Thus, in this case, we may actually say the parameter space is indeed  . With this parameter space (since   is assumed to take value in this union), then   is the complement of  .
  • Alternatively, some may view the parameter space to be "linked" with a distribution, and so for a given distribution, the parameter space is fixed to be the one suggested by the distribution itself. So, in this case,   is not the complement of   (with respect to the parameter space).
  • Despite the different definitions of   and  , a common feature is that we assume exactly one of   and   is true.

Example. Suppose your friend gives you a coin for tossing, and we do not know whether it is fair or not. However, since the coin is given by your friend, you believe that the coin is fair unless there are sufficient evidences suggesting otherwise. What is the null hypothesis and alternative hypothesis in this context (suppose the coin never land on edge)?

Solution. Let   be the probability for landing on heads after tossing the coin. The null hypothesis is  . The alternative hypothesis is  .

 

Exercise. Suppose we replace "coin" with "six-faced dice" in the above question. What is the null and alternative hypothesis? (Hint: You may let   be the probability for "1","2",...,"6" coming up after rolling the dice respectively.)

Solution

Let   be the probability for "1","2",...,"6" coming up after rolling the dice respectively. The null hypothesis is  , and the alternative hypothesis is   (In fact, when one of   is not  , it must cause at least one other probability to be different from  .)


We have mentioned that exactly one of   and   is assumed to be true. To make a decision, we need to decide which hypothesis should be regarded as true. Of course, as one may expect, this decision is not perfect, and we will have some errors involved in our decision. So, we cannot say we "prove that" a particular hypothesis is true (that is, we cannot be certain that a particular hypothesis is true). Despite this, we may "regard" (or "accept") a particular hypothesis as true (but not prove it as true) when we have sufficient evidences that lead us to make this decision (ideally, with small errors [2]).

Remark.

  • Philosophically, "not rejecting  " is different from "accepting  ", since the phrase "not rejecting  " can mean that we actually do not regard   as true but just do not have sufficient evidences to reject  , instead of meaning that we regard   as true. On the other hand, the phrase "accepting  " should mean that we regard   as true.
  • In spite of this, we will not handle these philosophical issues, and we will just assume that whenever there are not sufficient evidences to reject   (i.e., we do not reject  ), then we will act as if   is true, that is, still accept  , even if we may not actually "believe" in  .
  • Of course, in some other places, the saying of "accepting null hypothesis" is avoided because of these philosophical issues.

Now, we are facing with two questions. First, what evidences should we consider? Second, what is meant by "sufficient"? For the first question, a natural answer is that we should consider the observed samples, right? This is because we are making hypothesis about the population, and the samples are taken from, and thus closely related to the population, which should help us make the decision.

To answer the second question, we need the concepts in hypothesis testing. In particular, in hypothesis testing, we will construct a so-called rejection region or critical region to help us determining that whether we should reject the null hypothesis (i.e., regard   as false), and hence (naturally) regard   as true ("accept"  ) (we have assumed that exactly one of   and   is true, so when we regard one of them as false, we should regard another of them as true). In particular, when we do not reject  , we will act as if, or accept   as true (and thus should also reject   since exactly one of   of   is true).

Let us formally define the terms related to hypothesis testing in the following.

Definition. (Hypothesis test) A hypothesis test is a rule that specifies for which observed sample values we (do not reject and) accept   as true (and thus reject  ), and for which observed sample values we reject   and accept  .

Remark.

  • Hypothesis test is sometimes simply written as "test" for simplicity. We also sometimes use the Greek letters " ", " ", etc. to denote tests.

Definition. (Rejection and acceptance regions) Let   be the set containing all possible observations of a random sample  ,  . The rejection region (denoted by  ) is the subset of   for which   is rejected. The complement of rejection region (with respect to the set  ) ( ) is the acceptance region (it is thus the subset of   for which   is accepted).

Remark.

  • Graphically, it looks like
    S
*------------*
|///|........|
|///\........|
|////\.......| 
|/////\......|
*------------*

*--*
|//|: R
*--*

*--*
|..|: R^c
*--*

Typically, we use test statistic (a statistic for conducting a hypothesis test) to specify the rejection region. For instance, if the random sample is   and the test statistic is  , the rejection region may be, say,   (where   and   is observed value of   and   respectively). Through this, we can directly construct a hypothesis test: when  , we reject   and accept  . Otherwise, if  , we accept  . So, in general, to specify the rule in a hypothesis test, we just need a rejection region. After that, we will apply the test on testing   against  . There are some terminologies related to the hypothesis tests constructed in this way:

Definition. (Left-, Right- and two-tailed tests) Let   be an observed test statistic for a hypothesis test, and   be the realizations of random samples.

  • If the rejection region is in the form of  , then the hypothesis test is called a left-tailed test (or lower-tailed test).
  • If the rejection region is in the form of  , then the hypothesis test is called a right-tailed test (or upper-tailed test).
  • If the rejection region is in the form of  , then the hypothesis test is called a two-tailed test.

Remark.

  • The inequality signs can be strict, i.e., the above inequality signs can be replaced " " and " ".
  • We use the terminology "tail" since the rejection region includes the values that are located at the "extreme portions" (i.e., very left (with small values) or very right (with large values) portions) (called tails) of distributions.
  • When  , we may say the two-tailed test is equal-tailed. In this case, we can also express the rejection region as  .
  • We sometimes also call upper-tailed and lower-tailed tests as one-sided tests, and two-tailed tests as two-sided tests.

Example. Suppose the rejection region is  , and it is observed that  . Which hypothesis,   or  , should we accept?

Solution. Since  , we should (not reject and) accept  .

 

Exercise. What is the type of this hypothesis test?

Solution

Right-tailed test.


As we have mentioned, the decisions made by hypothesis test should not be perfect, and errors occur. Indeed, when we think carefully, there are actually two types of errors, as follows:

Definition. (Type I and II errors) A type I error is the rejection of   when   is true. A type II error is the acceptance of   when   is false.

We can illustrate these two types of errors more clearly using the following table.

Type I and II errors
Accept   Reject  
  is true Correct decision Type I error
  is false Type II error Correct decision

We can express   and  . Also, assume the rejection region is   (i.e., the rejection region with " " replaced by " "). In general, when " " is put together with " ", we assume  .

Then we have some notations and expressions for probabilities of making type I and II errors: (let   be a random sample and  )

  • The probability of making a type I error, denoted by  , is   if  .
  • The probability of making a type II error, denoted by  , is   if  .

Remark.

  • Remark on notations: In some other places,   may be expressed as " ", " " or " ". We should be careful that these notations are not supposed to interpret as conditional probabilities [3]. Instead, they are just some notations. This applies similarly to  .
  • When   contains a single value only, we simply denote the type I error probability by  . Similarly, when   contains a single value only, we simply denote the type II error probability by  .

Notice that we have a common expression in both   and  , which is " ". Indeed, we can also write this expression as

 
Through this, we can observe that this expression contains all informations about the probabilities of making errors, given a hypothesis test with rejection  . Hence, we will give a special name to it:

Definition. (Power function) Let   be a rejection region of a hypothesis test, and   be a random sample. Then, the power function of the hypothesis test is

 
where  .

Remark.

  • " " can be thought of as the Greek letter "p". We choose   instead of   since " " is sometimes used to denote probability (mass or density) functions.
  • The power function will be our basis in evaluating the goodness of a test or comparing two different tests.

Example. Suppose we toss a (fair or unfair) coin 5 times (suppose the coin never land on edge), and we have the following hypotheses:

 
where   is the probability for landing on heads after tossing the coin. Let   be the random sample for the 5 times of coin tossing, and   be the corresponding realizations. Also, the value of a random sample is 1 if heads come up and 0 otherwise. Suppose we will reject   if and only if heads come up in all 5 coin tosses.

(a) Determine the rejection region  .

(b) What is the power function   (express in terms of  )?

(c) Calculate   and  .

Solution.

(a) The rejection region  .

(b) The power function is  

(c) We have   and  . (Notice that although the probability of type I error can be low, the probability of type II error can be quite high. This is because, intuitively, it is quite "hard" to reject   due to the strict requirement. So, even if   is false, it may not be rejected, causing a type II error.)

 

Exercise. Does   exist? If yes, calculate it.

Solution

  exists, and   (notice that   is a strictly increasing function).

You notice that the type II error of this hypothesis test can be quite large, so you want to revise the test to lower the type II error.

(a) What is   in the above hypothesis test?

(b) Suppose the rejection region is modified to  . Calculate   and  . (Hint: consider binomial distribution.)

(c) Suppose the rejection region is modified to  . Calculate   and  .

(d)   is minimized at which hypothesis test: the original one, the one in (b), or the one in (c)?

Solution

(a)   if  .

(b) In this case, we have  , and  .

(c) In this case, we have   and  .

(d) At the original one,  , at the one in (b),  , and at the one in (c),  . So,   is minimized at the one in (b).


Example. Let   be a random sample from the normal distribution   where   is known. Consider the following hypotheses:

 
where   is a constant. We use the test statistic   for the hypothesis testing, and we reject   if and only if  .

Find the power function  ,  , and  .

Solution. The power function is

 
Thus,   and   (some abuse of notations), by the definition of cdf. (Indeed,   is a strictly increasing function of  .)
 

Exercise. Show that   if  .

Solution

Proof. Assume  . Then,  .

 



Ideally, we want to make both   and   arbitrarily small. But this is generally impossible. To understand this, we can consider the following extreme examples:

  • Set the rejection region   to be  , which is the set of all possible observations of random samples. Then,   for each  . From this, of course we have  , which is nice. But the serious problem is that   due to the mindless rejection.
  • Another extreme is setting the rejection region   to be the empty set  . Then,   for each  . From this, we have  , which is nice. But, again the serious problem is that   due to the mindless acceptance.

We can observe that to make   ( ) to be very small, it is inevitable that   ( ) will increase consequently, due to accepting (rejecting) "too much". As a result, we can only try to minimize the probability of making one type of error, holding the probability of making another type of error controlled.

Now, we are interested in knowing that which type of errors should be controlled. To motivate the choice, we can again consider the analogy of legal principle of presumption of innocence. In this case, type I error means proving an innocent guilty, and type II error means acquitting a guilty person. Then, as suggested by Blackstone's ratio, type I error is more serious and important than type II error. This motivates us to control the probability of type I error, i.e.,  , at a specified small value  , so that we can control the probability of making this more serious error. After that, we consider the tests that "control the type I error probability at this level", and the one with the smallest   is the "best" one (in the sense of probability of making errors).

To describe "control the type I error probability at this level" in a more precise way, let us define the following term.

Definition. (Size of a test) A test with power function   is a size   test if

 
where  .

Remark.

  • Supremum is similar to maximum, and in "nice" situations (you may assume the situations here are "nice"), supremum is the same as maximum. Hence, choosing the supremum of   over   as the size of a test means that the size of a test gives its maximum probability of type I error (reject   when   is true), considering all situations, i.e., all different possible values of   that makes   true.
  • Intuitively, we choose the maximum probability of type I error to be the size so that the size can tell us how probable type I error occurs in the worst situation, to show that how "well" can the test control the type I error [4].
  • Special case: if   contains a single parameter only, say (a known value)   (i.e.,   is a simple hypothesis which states that  ), then  .
  •   is also called the level of significance or significance level (these terms are related to the concept of statistical (in)significance, which is in turn related to the concept of  -value. We will discuss these later.).
  • The " " here and the " " in the confidence coefficient can actually be interpreted as the "same", by connecting confidence intervals with hypothesis testing. We will discuss these later.
  • Because of this definition, the null hypothesis conventionally contains an equality (i.e. in the form of   or  ), since the size of the test can be calculated more conveniently if this is the case.

So, using this definition, controlling the type I error probability at a particular level   means that the size of the test should not exceed  , i.e.,   (in some other places, such test is called a level   test.

Example. Consider the normal distribution   (with the parameter space:  ) , and the hypotheses

 
Let   be a random sample from the normal distribution  , and the corresponding realizations are  . Suppose the rejection region is  .

(a) Find   such that the significance level of the test is  .

(b) Calculate the type II error probability  . To have the type II error probability  , what is the minimum sample size (with the same rejection region)?

Solution.

(a) In order for the significance level to be 0.05, we need to have

 
But  . So, this means
 
where  . We then have
 

(b) The type II error probability is

 
( ) With the sample size  , the type II error probability is
 
When the sample size   increases,   will become more negative, and hence the type II error probability decreases. It follows that
 
Hence, the minimum sample size is 12.
 

Exercise. Calculate the type I error probability and type II error probability when the sample size is 12 (the rejection region remains unchanged).

Solution

The type II error probability is

 
The type I error probability is
 
So, with the same rejection region and different sample size, the significance level (type I error probability in this case) of the test changes.


For now, we have focused on using rejection region to conduct hypothesis tests. But this is not the only way. Alternatively, we can make use of  -value.

Definition. ( -value) Let   be an observed value of a test statistic   in a hypothesis test.

  • Case 1: The test is left-tailed. Then, the  -value is  .
  • Case 2: The test is right-tailed. Then, the  -value is  .
  • Case 3: The test is two-tailed.
  • Subcase 1: The distribution of   is symmetric about zero (when   is true). Then, the  -value is  .
  • Subcase 2: The distribution of   is not symmetric about zero (when   is true). Then, the  -value is  .

Remark.

  •  -value can be interpreted as the probability of a test statistic to be at least as "extreme" as the observed test statistic in a hypothesis test, when   is true. Here, "extreme" is in favour of  , i.e., the "direction of extreme" is towards the "direction of tail" for the test (when the test statistic is more towards the tail direction, it is more likely to fall in the rejection region, and therefore reject   and accept  ).
  • So, when the  -value is small, it implies that the observed value of the test statistic is already very "extreme", causing the test statistic to be unlikely to be even more "extreme" than the observed value.
  • In general, it can be quite difficult to compute  -values manually. Thus,  -values are often computed using software, e.g. R.
  • For case 3 subcase 1, consider the following diagram:
            pdf of T(X)
             |
           *---*
          /  |  \
         /   |   \
        /    |    \
       /|    |    |\
      /#|    |    |#\
     /##|    |    |##\    
 ---*###|    |    |###*---
 #######|    |    |#######
-------------*-------------
      ^            ^
<---->|   =====>   |<---->             T(x)<0
     T(x)         -T(x) 
"more extreme"          "more extreme"

T(X)<=T(x)          T(X)>=-T(x)          ====> |T(X)|>=|T(x)| ( T(x)=-|T(x)|, -T(x)=|T(x)|)

<-->^                ^<-->
    |                |                 T(x)>0
   -T(x)            T(x) 

T(X)<=-T(x)         T(X)>=T(x)            ====> |T(X)|>=|T(x)| (-T(x)=-|T(x)|, T(x)=|T(x)|)
  • For case 3 subcase 2, consider the following diagram:
                  pdf of T
    |
    |     /*----*
    |    /|      \
    |   /#|       \
    |  /##|        \       
    | /###|         *---|--------*
    |/####|             |#########\
----*------------------------------
        ^                      
        |
        T(x)
    |---|-------------------------|
  T(X)<=T(x)   T(X)>=T(x)       &&&&&: T(X)>= -T(x)
  choose
                              ^
                              |
                              t
    |-------------------------|---|
            T(X)<=T(x)      T(X)>=T(x)
     &&&:                      choose
    T(X)<=-T(x)
We can observe that the observed value   may lie at the left tail or right tail. In either case, for   to be "more extreme", the resulting inequality corresponds to the one with smaller probability. Thus, we have " ". But we also need to consider the "extreme" in another tail. It is intuitive to say that when   is more extreme than   (in another tail), then   should be also considered as "more extreme". Thus, there is a " "

The following theorem allows us to use  -value for hypothesis testing.

Theorem. Let   be an observed value of a test statistic   in a hypothesis test. The null hypothesis   is rejected at the significance level   if and only if the  -value is less than or equal to  .

Proof. (Partial) We can prove "if" and "only if" directions at once. Let us first consider the case 1 in the definition of  -value. By definitions,  -value is   and   (Define   such that  .). Then, we have

 
For other cases, the idea is similar (just the directions of inequalities for   are different).

 

Remark.

  • From this, we can observe that  -value can be used to report the test result in a more "continuous" scale, instead of just a single decision "accept  " or "reject  ", in the sense that if  -value is "much smaller" than the significance level  , then we have a "stronger" evidence for rejecting   (stronger in the sense that even if the significance level is very low (a very strict requirement on type I error),   can still be rejected).
  • Also, reporting a  -value allows readers to choose an appropriate significance level   themselves, and compare the  -value with  , and therefore make their own decisions, which are not necessary the same as the decisions made in the test report (since readers may choose a different significance level from that of the report).
  • Here, let us also mention about the concept of statistical significance. An observation has statistical significance if it is "unlikely" to happen (i.e., the observed value is quite "extreme") when the null hypothesis is true. To be more precise, in terms of  -value, this means an observed value of a test statistic is statistically significant if  -value is less than or equal to  , and we say that the observed value is statistically insignificant otherwise. Thus,   can be interpreted as the benchmark of "significant" or "extremeness", and hence the name significance level.

Example. Recall the setting of a previous example: consider the normal distribution   (with the parameter space for  :  ) , and the hypotheses

 
Let   be a random sample from the normal distribution  , and the corresponding realizations are  .

At the significance level  , we have determined that the rejection region is  . Suppose it is observed that  .

(a) Use the rejection region to determine whether we should reject  .

(b) Use  -value to determine whether we should reject  .

Solution.

(a) Since  , we have  . Thus, we should not reject  .

(b) Since the test is right-tailed, the  -value is

 
where  . Thus,   should not be rejected.
 

Exercise.

Choose the significance level(s) for which   is rejected based on the observation.

0.01
0.04
0.06
0.08
0.1

Remark.

  • From this, we can notice that one can "manipulate" the decision by changing the significance level. In fact, if one sets the significance level to be 1 , then   must be rejected (since  -value is a probability which must be less than or equal to 1). But such significance level is meaningless, since it means that the type I error probability can be as high as 1, so such test has a large error and the result is not reliable anyway.
  • On the other hand, if one sets the significance level to be 0, then   must not be rejected (unless the  -value is exactly zero, which is very unlikely, since zero  -value means that the observation is the most extreme one, so it is (almost) never for the test statistic to be at least as extreme as the observation.).



Evaluating a hypothesis test edit

After discussing some basic concepts and terminologies, let us now study some ways to evaluate goodness of a hypothesis test. As we have previously mentioned, we want the probability of making type I errors and type II errors to be small, but we have mentioned that it is generally impossible to make both probabilities to be arbitrarily small. Hence, we have suggested to control the type I error, using the size of a test, and the "best" test should the one with the smallest probability of making type II error, after controlling the type I error.

These ideas lead us to the following definitions.

Definition. (Power of a test) The power of a test is the probability of rejecting   when   is false. That is, the power is  , if the probability of making type II error is  .

Using this definition, instead of saying "best" test (test with the smallest type II error probability), we can say "a test with the most power", or in other words, the "most powerful test".

Definition. (Uniformly most powerful test) A test   with rejection region   is a uniformly most powerful (UMP) test with size   for testing   ( ) if

  • (size of  )  , and
  • (UMP)   for each   and for each test   with rejection region   and   (  is the power function of the test  ).

(  and   are the power functions of the tests   and   respectively.)

Remark.

  • The rejection region   is sometimes called a best rejection region of size  .
  • In other words, a test is UMP with size   if it has size   and its power is the largest among all other tests with size less than or equal to  , for each  . The adverb "uniformly" emphasizes that this is true for each  .
  • Since the power is largest for each value of  , the rejection region   of the UMP test does not depend on the choice of  , that is, regardless of the chosen value of  , the rejection region is the same. This is expected since the rejection region   is not supposed to be changed when the choice of   is different. The rejection region   (fixed) should always be the best, for each  .
  • If   is simple, we may simply call the UMP test as the most powerful (MP) test.

Constructing a hypothesis test edit

There are many ways of constructing a hypothesis test, but of course not all are good (i.e., "powerful"). In the following, we will provide some common approaches to construct hypothesis tests. In particular, the following lemma is very useful for constructing a MP test with size  .

Neyman-Pearson lemma edit

Lemma. (Neyman-Pearson lemma) Let   be a random sample from a population with pdf or pmf   (  may be a parameter vector, and the parameter space is  ). Let   be the likelihood function. Then a test   with rejection region

 
and size   is the MP test with size   for testing
 
where   is a value determined by the size  .

Proof. Let us first consider the case where the underlying distribution is continuous. With the assumption that the size of   is  , the "size" requirement for being a UMP test is satisfied immediately. So, it suffices to show that   satisfies the "UMP" requirement for being a MP test.

Notice that in this case, " " is simply  . So, for every test   with rejection region   and  , we will proceed to show that  .

Since

 
we have   as desired.

For the case where the underlying distribution is discrete, the proof is very similar (just replace the integrals with sums), and hence omitted.

 

Remark.

  • Sometimes, we call   as the likelihood ratio.
  • In fact, the MP test constructed by Neyman-Pearson lemma is a variant from the likelihood-ratio test, which is more general in the sense that the likelihood-ratio test can also be constructed for composite null and alternative hypotheses, apart from simple null and alternative hypotheses directly. But, the likelihood-ratio test may not be (U)MP. We will discuss likelihood-ratio test later.
  • For the discrete distribution, it may be impossible to determine a   for the rejection   for some  . In this case, we call such   to be not attainable.
  • Intuitively, this test means that we should reject   when "likelihood" of   ( ) is not as large as the "likelihood" of   ( ) ( ), with respect to the observed samples. For the meaning of "not as large as", it depends on the size  .
  • Intuitively, we will expect that   should be a positive value that is strictly less than 1, so that   is "less likely" than  . This is usually, but not necessarily, the case. Particularly, when the size   is large,   may be greater than 1.
  • Typically, to determine the value of  , we need to transform " " to another equivalent inequality for which the probability under   is easier to be calculated.
  • It must be equivalent, so that its probability under   is the same as probability of " " under  . As a result, during the process of the transformation, it is better to use " ", instead of just " ", or even writing different inequalities line by line.
  • If   is a vector, then   and   should also be vectors.

Even if the hypotheses involved in the Neyman-Pearson lemma are simple, with some conditions, we can use the lemma to construct a UMP test for testing composite null hypothesis against composite alternative hypothesis. The details are as follows: For testing  

  1. Find a MP test   with size  , for testing   using the Neyman-Pearson lemma, where   is an arbitrary value such that  .
  2. If the rejection region   does not depend on  , then the test   has the greatest power for each  . So, the test   is a UMP test with size   for testing  
  3. If we can further show that  , then it means that the size of the test   is still  , even if the null hypothesis is changed to  . So, after changing   to   and not changing   (also adjusting the parameter space) for the test  , the test   still satisfies the "MP" requirement (because of not changing  , so the result in step 2 still applies), and also the test   will satisfy the "size" requirement (because of changing   in this way). Hence, the test   is a UMP test with size   for testing  .

For testing  , the steps are similar. But in general, there is no UMP test for testing  .

Of course, when the condition in step 3 holds but that in step 2 does not hold, the test   in step 1 is a UMP test with size   for testing   where   is a constant (which is larger than  , or else   and   are not disjoint). However, the hypotheses are generally not in this form.

Example. Let   be a random sample from the normal distribution  .

(a) Construct a MP test   with size 0.05 for testing  .

(b) Hence, show that the test   is also a UMP test with size 0.05 for testing  .

(c) Hence, show that the test   is also a UMP test with size 0.05 for testing  .

Solution. (a) We can use the Neyman-Pearson lemma. First, consider the likelihood ratio

 
Now, we have
 
where   are some constants. To find  , consider the size 0.05:
 
( ) Hence, we have  . Now, we can construct the rejection region:
 
and the test   with the rejection region   is a MP test with size 0.05 for testing  . (b)

Proof. Let   be an arbitrary value such that  . Then, we can show that (see the following exercise)

 
where   are some constants (may be different from the above constants). Since   here is the same as   in (a), the rejection region constructed is also
 
Notice that   does not depend on the value of  . It follows that the test   is a UMP test with size 0.05 for testing  .

 


(c)

Proof. It suffices to show that  . First let us consider the power function

 
where   is the cdf of  . Now, since when   increases,   decreases and hence   decreases, it follows that the power function   is a strictly increasing function of  . Hence,
 
Then, the result follows.

 

 

Exercise. Show that

 
for every  .
Solution

Proof. First, consider the likelihood ratio

 
Then, we have
 
(The last equivalence follows since  .)

 



Remark.

  • This rejection region has appeared in a previous example.

Now, let us consider another example where the underlying distribution is discrete.

Example. Let   be a discrete random variable. Its pmf is given by

 
(Notice that the sum of values in each row is 1. The parameter space is  .) Given a single observation  , construct a MP test with size 0.1 for testing  .

Solution. We use the Neyman-Pearson lemma. First, we calculate the likelihood ratio   for each value of  :

 
For convenience, let us sort the likelihood ratios in ascending order (we put the undefined value at the last):
 
By Neyman-Pearson lemma, the MP test with size 0.1 for testing   is a test with size 0.1 and rejection region
 
So, it remains to determine  . Since the size is 0.1, we have
 
Notice that
 
So, we can choose   (approximately), so that rejection region is
 
 

Exercise. Calculate the probability of making type II error for the above test.

Solution

The probability is

 
(Notice that although the test is MP, the type II error probability is still quite large in this case.)

Use Neyman-Pearson lemma to construct another MP test with size 0.05 for testing  .

Solution

It is impossible to construct the MP test with this size using Neyman-Pearson lemma, since we are unable to choose a   such that  . We can choose   (approximately) for the size to be 0.04, or choose   for the size to be 0.06, but we are impossible to choose a   for the size to be 0.05.



Likelihood-ratio test edit

Previously, we have suggested using the Neyman-Pearson lemma to construct MPT for testing simple null hypothesis against simple alternative hypothesis. However, when the hypotheses are composite, we may not be able to use the Neyman-Pearson lemma. So, in the following, we will give a general method for constructing tests for any hypotheses, not limited to simple hypotheses. But we should notice that the tests constructed are not necessarily UMPT.

Definition. (Likelihood ratio test) Let  . The likelihood ratio test with size   for testing   ( , and   may be a vector) has the rejection region

 
where   is a constant determined by the size  .

Remark.

  • When   and   exist, we have   where   is the restricted maximum likelihood estimate of   in  , and   is the maximum likelihood estimate of   in  . We can assume the existence in the following.
  • Since  ,  , so we have  .
  • Intuitively, when   is very small, i.e.,  , it suggests that there are many   falling   that are more likely than all   in  . So,   should be intuitively rejected.
  • On the other hand, when   is very close to 1, i.e.,  , it suggests that there are very few   falling in   that are more likely than all   in  . So,   should be intuitively not rejected.
  • When the simple and alternative hypotheses are simple, the likelihood ratio test will be the same as the test suggested in the Neyman-Pearson lemma.


Relationship between hypothesis testing and confidence intervals edit

We have mentioned that there are similarities between hypothesis testing and confidence intervals. In this section, we will introduce a theorem suggesting how to construct a hypothesis test from a confidence interval (or in general, confidence set), and vice versa.

Theorem. For each  , let   be the rejection region of a size   test for testing  . Also, let   be the corresponding realizations of a random sample   from the underlying distribution. Furthermore, let   and  .

Define a set

 
Then, the random set   is a   confidence set of  .

Conversely, let a set   be a   confidence set of an unknown parameter  . For each  , define

 
Then,   is the rejection region of a size   test for testing  .

Proof. For the first part, since   is the rejection region of the size   test, we have

 
Hence, the coverage probability for the random set   is
 
which means that the random set   is a   confidence set of  .

For the second part, by assumption, we have

 
So, the size of the test with the rejection region   is
 

 

Remark.

  • The " " can take arbitrary value in  . So, one can take   as the unknown parameter of a distribution.
  • Usually, the first result is more useful. But, the second result justifies our intuition that given a confidence interval of an unknown parameter  , if a particular value   lies in the confidence interval, then we are " " confident that  . Now, from this theorem, we know that we can interpret "being   confident that  " as "accepting   at the significance level  ".
  • For example, if a 95% confidence interval of   is  , and since  , we intuitively say that we are 95% confident that  . Now, we can say, more formally, we accept   at the significance level  .
  • So, the relationship between the confidence coefficient   and the significance level   is now clear.
  • In some places, given some observed values, when a   confidence interval of   includes zero, then   has only statistically insignificant difference from  . This saying is natural when we consider the relationship between confidence coefficient and the significance level.
  • Since 0 is included in the confidence interval, we accept (and do not reject)   at the significance level  . This means that the observed values are statistically insignificant. Hence, we have such saying.


  1. If   is empty, then this hypothesis has no meaning at all, so we are not interested in this case.
  2. Thus, a natural measure of "goodness" of a hypothesis test is its "size of errors". We will discuss these later in this chapter.
  3. This is because it is meaningless to condition on " " or "  (is true)" which are not random, and hence has probability zero or one. When the probability is zero, the "conditional probability" is not defined. When the probability is one, conditional on it is the same as not conditional on it.
  4. Even when a test has low probability of making type I error for most parameter values in  , if the test has a high probability of making type I error for a certain parameter value in  , then this intuitively means that the test does not control the type I error well, right?