Statistics/Hypothesis Testing

Statistics
Hypothesis Testing

Introduction

In previous chapters, we have discussed two methods for estimating unknown parameters, namely point estimation and interval estimation. Estimating unknown parameters is an important area in statistical inference, and in this chapter we will discuss another important area, namely hypothesis testing, which is related to decision making. Indeed, the concepts of confidence intervals and hypothesis testing are closely related, as we will demonstrate.

Basic concepts and terminologies

Before discussing how to conduct hypothesis testing, and evaluate the "goodness" of a hypothesis test, let us introduce some basic concepts and terminologies related to hypothesis testing first.

Definition. (Hypothesis) A (statistical) hypothesis is a statement about population parameter(s).

There are two terms that classify hypotheses:

Definition. (Simple and composite hypothesis) A hypothesis is a simple hypothesis if it completely specifies the distribution of the population (that is, the distribution is completely known, without any unknown parameters involved), and is a composite hypothesis otherwise.

Sometimes, it is not immediately clear that whether a hypothesis is simple or composite. To understand the classification of hypotheses more clearly, let us consider the following example.

Example. Consider a distribution with parameter $\theta$ , taking values in the parameter space $\Theta =[0,\infty )$ . Determine whether each of the following hypotheses is simple or composite.

(a) $\theta =1$ .

(b) $\theta =\theta _{0}$ where $\theta _{0}\in \Theta$ is known.

(c) $\theta >1$ .

(d) $\theta >\theta _{0}$ where $\theta _{0}\in \Theta$ is known.

(e) $\theta \leq \theta _{0}$ where $\theta _{0}\in \Theta$ is known.

(f) $\theta \in \Theta _{0}$ where $\Theta _{0}$ is a nonempty subset of $\Theta$ . ^[1]

Solution.

(a) and (b) are simple hypotheses, since they all completely specifies the distribution.
(c), (d) and (e) are composite hypotheses, since the parameter $\theta$ is not completely specified, then so is the distribution.
(f) may be simple hypothesis or composite hypothesis, depending on $\Theta _{0}$ . If $\Theta _{0}$ contains exactly one element, then it is simple hypothesis. Otherwise, it is composite hypothesis.

In hypothesis tests, we consider two hypotheses:

Definition. (Null hypothesis and alternative hypothesis) In hypothesis testing, the hypothesis being tested is the null hypothesis (denoted by $H_{0}$ ) and another complementary hypothesis (to $H_{0}$ ) is the alternative hypothesis (denoted by $H_{1}$ ).

Remark.

$H_{1}$ is complementary hypothesis to $H_{0}$ in the sense that if $H_{0}$ is true (false), then $H_{1}$ is false (true) (exactly one of $H_{0}$ and $H_{1}$ is true). Because of this, we usually say $H_{0}$ is tested against $H_{1}$ (so we often write $H_{0}\quad {\text{vs.}}\quad H_{1}$ ).
Usually, $H_{0}$ usually corresponds to the status quo ("no effect"), and $H_{1}$ corresponds to some interesting "research findings" (So, $H_{1}$ is sometimes also called research hypothesis.).

Since $H_{0}$ often corresponds to the status quo, we usually assume $H_{0}$ is true, unless there are sufficient evidences against it.
This is somehow analogous to the legal principle of presumption of innocence which states that every person accused of any crime is considered innocent ( $H_{0}$ is assumed to be true), until proven guilty (there are sufficient evidences against $H_{0}$ ).

A general form of $H_{0}$ and $H_{1}$ is $H_{0}:\theta \in \Theta _{0}$ and $H_{1}:\theta \in \Theta _{1}$ where $\Theta _{1}=\Theta _{0}^{c}$ , which is the complement of $\Theta _{0}$ (with respect to $\Theta$ ), i.e., $\Theta _{0}^{c}=\Theta \setminus \Theta _{0}$ ( $\Theta$ is the parameter space, containing all possible values of $\theta$ ). The reason for choosing the complement of $\Theta _{0}$ in $H_{1}$ is that $H_{1}$ is the complementary hypothesis to $H_{0}$ , as suggested in the above definition.

Remark.

In some books, it is only required that $\Theta _{0}$ and $\Theta _{1}$ to be disjoint (nonempty) subsets of the parameter space $\Theta$ , and it is not necessary that $\Theta _{0}\cup \Theta _{1}=\Theta$ .
However, usually it is still assumed that exactly one of $H_{0}$ and $H_{1}$ is true, so it means that $\theta$ is not supposed to take values outside the set $\Theta _{0}\cup \Theta _{1}$ (otherwise, none of $H_{0}$ and $H_{1}$ will be true).
Thus, in this case, we may actually say the parameter space is indeed $\Theta _{0}\cup \Theta _{1}$ . With this parameter space (since $\theta$ is assumed to take value in this union), then $\Theta _{1}$ is the complement of $\Theta _{0}$ .
Alternatively, some may view the parameter space to be "linked" with a distribution, and so for a given distribution, the parameter space is fixed to be the one suggested by the distribution itself. So, in this case, $\Theta _{1}$ is not the complement of $\Theta _{0}$ (with respect to the parameter space).
Despite the different definitions of $\Theta _{0}$ and $\Theta _{1}$ , a common feature is that we assume exactly one of $H_{0}$ and $H_{1}$ is true.

Example. Suppose your friend gives you a coin for tossing, and we do not know whether it is fair or not. However, since the coin is given by your friend, you believe that the coin is fair unless there are sufficient evidences suggesting otherwise. What is the null hypothesis and alternative hypothesis in this context (suppose the coin never land on edge)?

Solution. Let $p$ be the probability for landing on heads after tossing the coin. The null hypothesis is $H_{0}:p={\frac {1}{2}}$ . The alternative hypothesis is $H_{1}:p\neq {\frac {1}{2}}$ .

Exercise. Suppose we replace "coin" with "six-faced dice" in the above question. What is the null and alternative hypothesis? (Hint: You may let $p_{1},p_{2},\dotsc ,p_{6}$ be the probability for "1","2",...,"6" coming up after rolling the dice respectively.)

Solution

Let $p_{1},p_{2},\dotsc ,p_{6}$ be the probability for "1","2",...,"6" coming up after rolling the dice respectively. The null hypothesis is $H_{0}:p_{1}=p_{2}=\dotsb =p_{6}={\frac {1}{6}}$ , and the alternative hypothesis is $H_{1}:{\text{at least one of }}p_{1},\dotsc ,p_{6}\neq {\frac {1}{6}}$ (In fact, when one of $p_{1},\dotsc ,p_{6}$ is not ${\frac {1}{6}}$ , it must cause at least one other probability to be different from ${\frac {1}{6}}$ .)

We have mentioned that exactly one of $H_{0}$ and $H_{1}$ is assumed to be true. To make a decision, we need to decide which hypothesis should be regarded as true. Of course, as one may expect, this decision is not perfect, and we will have some errors involved in our decision. So, we cannot say we "prove that" a particular hypothesis is true (that is, we cannot be certain that a particular hypothesis is true). Despite this, we may "regard" (or "accept") a particular hypothesis as true (but not prove it as true) when we have sufficient evidences that lead us to make this decision (ideally, with small errors ^[2]).

Remark.

Philosophically, "not rejecting $H_{0}$ " is different from "accepting $H_{0}$ ", since the phrase "not rejecting $H_{0}$ " can mean that we actually do not regard $H_{0}$ as true but just do not have sufficient evidences to reject $H_{0}$ , instead of meaning that we regard $H_{0}$ as true. On the other hand, the phrase "accepting $H_{0}$ " should mean that we regard $H_{0}$ as true.
In spite of this, we will not handle these philosophical issues, and we will just assume that whenever there are not sufficient evidences to reject $H_{0}$ (i.e., we do not reject $H_{0}$ ), then we will act as if $H_{0}$ is true, that is, still accept $H_{0}$ , even if we may not actually "believe" in $H_{0}$ .
Of course, in some other places, the saying of "accepting null hypothesis" is avoided because of these philosophical issues.

Now, we are facing with two questions. First, what evidences should we consider? Second, what is meant by "sufficient"? For the first question, a natural answer is that we should consider the observed samples, right? This is because we are making hypothesis about the population, and the samples are taken from, and thus closely related to the population, which should help us make the decision.

To answer the second question, we need the concepts in hypothesis testing. In particular, in hypothesis testing, we will construct a so-called rejection region or critical region to help us determining that whether we should reject the null hypothesis (i.e., regard $H_{0}$ as false), and hence (naturally) regard $H_{1}$ as true ("accept" $H_{1}$ ) (we have assumed that exactly one of $H_{0}$ and $H_{1}$ is true, so when we regard one of them as false, we should regard another of them as true). In particular, when we do not reject $H_{0}$ , we will act as if, or accept $H_{0}$ as true (and thus should also reject $H_{1}$ since exactly one of $H_{0}$ of $H_{1}$ is true).

Let us formally define the terms related to hypothesis testing in the following.

Definition. (Hypothesis test) A hypothesis test is a rule that specifies for which observed sample values we (do not reject and) accept $H_{0}$ as true (and thus reject $H_{1}$ ), and for which observed sample values we reject $H_{0}$ and accept $H_{1}$ .

Remark.

Hypothesis test is sometimes simply written as "test" for simplicity. We also sometimes use the Greek letters " $\varphi$ ", " $\psi$ ", etc. to denote tests.

Definition. (Rejection and acceptance regions) Let $S$ be the set containing all possible observations of a random sample $X_{1},\dotsc ,X_{n}$ , $\{\mathbf {x} \}=\{(x_{1},\dotsc ,x_{n})\}$ . The rejection region (denoted by $R$ ) is the subset of $S$ for which $H_{0}$ is rejected. The complement of rejection region (with respect to the set $S$ ) ( $R^{c}$ ) is the acceptance region (it is thus the subset of $S$ for which $H_{0}$ is accepted).

Remark.

Graphically, it looks like

    S
*------------*
|///|........|
|///\........|
|////\.......| 
|/////\......|
*------------*

*--*
|//|: R
*--*

*--*
|..|: R^c
*--*

Typically, we use test statistic (a statistic for conducting a hypothesis test) to specify the rejection region. For instance, if the random sample is $X_{1},\dotsc ,X_{n}$ and the test statistic is ${\overline {X}}$ , the rejection region may be, say, $R=\{\mathbf {x} :{\overline {x}}<2\}$ (where $x_{1},\dotsc ,x_{n}$ and ${\overline {x}}$ is observed value of $X_{1},\dotsc ,X_{n}$ and ${\overline {X}}$ respectively). Through this, we can directly construct a hypothesis test: when $\mathbf {x} \in R$ , we reject $H_{0}$ and accept $H_{1}$ . Otherwise, if $\mathbf {x} \in R^{c}$ , we accept $H_{0}$ . So, in general, to specify the rule in a hypothesis test, we just need a rejection region. After that, we will apply the test on testing $H_{0}$ against $H_{1}$ . There are some terminologies related to the hypothesis tests constructed in this way:

Definition. (Left-, Right- and two-tailed tests) Let $T(\mathbf {x} )=T(x_{1},\dotsc ,x_{n})$ be an observed test statistic for a hypothesis test, and $x_{1},\dotsc ,x_{n}$ be the realizations of random samples.

If the rejection region is in the form of $\{\mathbf {x} :T(\mathbf {x} )\leq k_{1}\}$ , then the hypothesis test is called a left-tailed test (or lower-tailed test).
If the rejection region is in the form of $\{\mathbf {x} :T(\mathbf {x} )\geq k_{2}\}$ , then the hypothesis test is called a right-tailed test (or upper-tailed test).
If the rejection region is in the form of $\{\mathbf {x} :T(\mathbf {x} )\leq k_{3}{\text{ or }}T(\mathbf {x} )\geq k_{4}\}$ , then the hypothesis test is called a two-tailed test.

Remark.

The inequality signs can be strict, i.e., the above inequality signs can be replaced " $<$ " and " $>$ ".
We use the terminology "tail" since the rejection region includes the values that are located at the "extreme portions" (i.e., very left (with small values) or very right (with large values) portions) (called tails) of distributions.
When $k_{3}=-k_{4}$ , we may say the two-tailed test is equal-tailed. In this case, we can also express the rejection region as $\{\mathbf {x} :|T(\mathbf {x} )|\geq k_{4}\}$ .
We sometimes also call upper-tailed and lower-tailed tests as one-sided tests, and two-tailed tests as two-sided tests.

Example. Suppose the rejection region is $R=\{(x_{1},x_{2},x_{3}):x_{1}+x_{2}+x_{3}>6\}$ , and it is observed that $x_{1}=1,x_{2}=2,x_{3}=3$ . Which hypothesis, $H_{0}$ or $H_{1}$ , should we accept?

Solution. Since $(x_{1},x_{2},x_{3})\in R^{c}$ , we should (not reject and) accept $H_{0}$ .

Exercise. What is the type of this hypothesis test?

Solution

Right-tailed test.

As we have mentioned, the decisions made by hypothesis test should not be perfect, and errors occur. Indeed, when we think carefully, there are actually two types of errors, as follows:

Definition. (Type I and II errors) A type I error is the rejection of $H_{0}$ when $H_{0}$ is true. A type II error is the acceptance of $H_{0}$ when $H_{0}$ is false.

We can illustrate these two types of errors more clearly using the following table.

Type I and II errors
	Accept $H_{0}$	Reject $H_{0}$
$H_{0}$ is true	Correct decision	Type I error
$H_{0}$ is false	Type II error	Correct decision

We can express $H_{0}:\theta \in \Theta _{0}$ and $H_{1}:\theta \in \Theta _{0}^{c}$ . Also, assume the rejection region is $R=R(\mathbf {X} )$ (i.e., the rejection region with " $x$ " replaced by " $X$ "). In general, when " $R$ " is put together with " $X$ ", we assume $R=R(\mathbf {X} )$ .

Then we have some notations and expressions for probabilities of making type I and II errors: (let $X_{1},\dotsc ,X_{n}$ be a random sample and $\mathbf {X} =(X_{1},\dotsc ,X_{n})$ )

The probability of making a type I error, denoted by $\alpha (\theta )$ , is $\mathbb {P} _{\theta }(\mathbf {X} \in R)$ if $\theta \in \Theta _{0}$ .
The probability of making a type II error, denoted by $\beta (\theta )$ , is $\mathbb {P} _{\theta }(\mathbf {X} \in R^{c})=1-\mathbb {P} _{\theta }(\mathbf {X} \in R)$ if $\theta \in \Theta _{0}^{c}$ .

Remark.

Remark on notations: In some other places, $\alpha (\theta )$ may be expressed as " $\mathbb {P} (\mathbf {X} \in R|\theta \in \Theta _{0})$ ", " $\mathbb {P} (\mathbf {X} \in R|H_{0})$ " or " $\mathbb {P} (\mathbf {X} \in R|H_{0}{\text{ is true}})$ ". We should be careful that these notations are not supposed to interpret as conditional probabilities ^[3]. Instead, they are just some notations. This applies similarly to $\beta (\theta )$ .
When $\Theta _{0}$ contains a single value only, we simply denote the type I error probability by $\alpha$ . Similarly, when $\Theta _{1}$ contains a single value only, we simply denote the type II error probability by $\beta$ .

Notice that we have a common expression in both $\alpha (\theta )$ and $\beta (\theta )$ , which is " $\mathbb {P} _{\theta }((X_{1},\dotsc ,X_{n})\in R)$ ". Indeed, we can also write this expression as $\mathbb {P} _{\theta }((X_{1},\dotsc ,X_{n})\in R)={\begin{cases}\alpha (\theta ),&\theta \in \Theta _{0};\\1-\beta (\theta ),&\theta \in \Theta _{0}^{c}.\end{cases}}$ Through this, we can observe that this expression contains all informations about the probabilities of making errors, given a hypothesis test with rejection $R$ . Hence, we will give a special name to it:

Definition. (Power function) Let $R$ be a rejection region of a hypothesis test, and $X_{1},\dotsc ,X_{n}$ be a random sample. Then, the power function of the hypothesis test is $\pi (\theta )=\mathbb {P} _{\theta }((X_{1},\dotsc ,X_{n})\in R)$ where $\theta \in \Theta$ .

Remark.

" $\pi$ " can be thought of as the Greek letter "p". We choose $\pi$ instead of $p$ since " $p$ " is sometimes used to denote probability (mass or density) functions.
The power function will be our basis in evaluating the goodness of a test or comparing two different tests.

Example. Suppose we toss a (fair or unfair) coin 5 times (suppose the coin never land on edge), and we have the following hypotheses: $H_{0}:p\leq {\frac {1}{2}}\quad {\text{vs.}}\quad H_{1}:p>{\frac {1}{2}}$ where $p$ is the probability for landing on heads after tossing the coin. Let $X_{1},\dotsc ,X_{5}$ be the random sample for the 5 times of coin tossing, and $x_{1},\dotsc ,x_{5}$ be the corresponding realizations. Also, the value of a random sample is 1 if heads come up and 0 otherwise. Suppose we will reject $H_{0}$ if and only if heads come up in all 5 coin tosses.

(a) Determine the rejection region $R$ .

(b) What is the power function $\pi (p)$ (express in terms of $p$ )?

(c) Calculate $\alpha (1/2)$ and $\beta (2/3)$ .

Solution.

(a) The rejection region $R=\{(x_{1},\dotsc ,x_{5}):x_{1}+\dotsb +x_{5}=5\}$ .

(b) The power function is $\pi (p)={\begin{cases}\mathbb {P} _{\theta }((X_{1},\dotsc ,X_{5})\in R)=p^{5},&p\leq {\frac {1}{2}};\\1-\mathbb {P} _{\theta }((X_{1},\dotsc ,X_{5})\in R)=1-p^{5},&p>{\frac {1}{2}}.\end{cases}}$

(c) We have $\alpha (1/2)=\left({\frac {1}{2}}\right)^{5}=0.03125$ and $\beta (2/3)=1-\left({\frac {2}{3}}\right)^{5}\approx 0.8683$ . (Notice that although the probability of type I error can be low, the probability of type II error can be quite high. This is because, intuitively, it is quite "hard" to reject $H_{0}$ due to the strict requirement. So, even if $H_{0}$ is false, it may not be rejected, causing a type II error.)

Exercise. Does $\max _{p\leq {\frac {1}{2}}}\pi (p)$ exist? If yes, calculate it.

Solution

$\max _{p\leq {\frac {1}{2}}}\pi (p)$ exists, and $\max _{p\leq {\frac {1}{2}}}\pi (p)=\max _{p\leq {\frac {1}{2}}}p^{5}=\left({\frac {1}{2}}\right)^{5}={\frac {1}{32}}$ (notice that $y=x^{5}$ is a strictly increasing function).

You notice that the type II error of this hypothesis test can be quite large, so you want to revise the test to lower the type II error.

(a) What is $\beta (p)$ in the above hypothesis test?

(b) Suppose the rejection region is modified to $\{(x_{1},\dotsc ,x_{5}):x_{1}+\dotsb +x_{5}\geq 3\}$ . Calculate $\alpha (1/2)$ and $\beta (2/3)$ . (Hint: consider binomial distribution.)

(c) Suppose the rejection region is modified to $\{(x_{1},\dotsc ,x_{5}):x_{1}+\dotsb +x_{5}\geq 2\}$ . Calculate $\alpha (1/2)$ and $\beta (2/3)$ .

(d) $\alpha (1/2)+\beta (2/3)$ is minimized at which hypothesis test: the original one, the one in (b), or the one in (c)?

Solution

(a) $\beta (p)=p^{5}$ if $p>{\frac {1}{2}}$ .

(b) In this case, we have $\alpha (1/2)={\binom {5}{3}}\left({\frac {1}{2}}\right)^{3}\left({\frac {1}{2}}\right)^{2}+{\binom {5}{4}}\left({\frac {1}{2}}\right)^{4}\left({\frac {1}{2}}\right)+\left({\frac {1}{2}}\right)^{5}=0.5$ , and $\beta (2/3)=1-{\binom {5}{3}}\left({\frac {2}{3}}\right)^{3}\left({\frac {1}{3}}\right)^{2}+{\binom {5}{4}}\left({\frac {2}{3}}\right)^{4}\left({\frac {1}{3}}\right)+\left({\frac {2}{3}}\right)^{5}\approx 0.2099$ .

(c) In this case, we have $\alpha (1/2)=0.5+{\binom {5}{2}}\left({\frac {1}{2}}\right)^{2}\left({\frac {1}{2}}\right)^{3}=0.8125$ and $\beta (2/3)\approx 0.2099-{\binom {5}{2}}\left({\frac {2}{3}}\right)^{2}\left({\frac {1}{3}}\right)^{3}\approx 0.0453$ .

(d) At the original one, $\alpha (1/2)+\beta (2/3)\approx 0.89955$ , at the one in (b), $\alpha (1/2)+\beta (2/3)\approx 0.7099$ , and at the one in (c), $\alpha (1/2)+\beta (2/3)\approx 0.8578$ . So, $\alpha (1/2)+\beta (2/3)$ is minimized at the one in (b).

Example. Let $X_{1},\dotsc ,X_{n}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ where $\sigma ^{2}$ is known. Consider the following hypotheses: $H_{0}:\mu \leq \mu _{0}\quad {\text{vs.}}\quad \mu >\mu _{0}$ where $\mu _{0}$ is a constant. We use the test statistic $T={\frac {{\overline {X}}-\mu _{0}}{\sigma /{\sqrt {n}}}}\sim {\mathcal {N}}(0,1)$ for the hypothesis testing, and we reject $H_{0}$ if and only if $T\geq k$ .

Find the power function $\pi (\mu )$ , $\lim _{\mu \to -\infty }\pi (\mu )$ , and $\lim _{\mu \to \infty }\pi (\mu )$ .

Solution. The power function is ${\begin{aligned}\pi (\mu )&=\mathbb {P} _{\mu }(T\geq k)\\&=\mathbb {P} _{\mu }\left({\frac {{\overline {X}}-\mu _{0}}{\sigma /{\sqrt {n}}}}\geq k\right)\\&=\mathbb {P} _{\mu }\left({\frac {{\overline {X}}-\mu +\mu -\mu _{0}}{\sigma /{\sqrt {n}}}}\geq k\right)\\&=\mathbb {P} _{\mu }\left({\frac {{\overline {X}}-\mu }{\sigma /{\sqrt {n}}}}\geq k+{\frac {\mu _{0}-\mu }{\sigma /{\sqrt {n}}}}\right)\\&=\mathbb {P} \left(Z\geq k+{\frac {\mu _{0}-\mu }{\sigma /{\sqrt {n}}}}\right).&(Z\sim {\mathcal {N}}(0,1),{\text{ which is independent from }}\mu ,{\text{so we can drop the subscript `}}\mu {\text{' for }}\mathbb {P} )\\\end{aligned}}$ Thus, $\lim _{\mu \to -\infty }\pi (\mu )=\mathbb {P} (Z\geq \infty )=0$ and $\lim _{\mu \to \infty }\pi (\mu )=\mathbb {P} (Z\geq -\infty )=1$ (some abuse of notations), by the definition of cdf. (Indeed, $\pi (\mu )$ is a strictly increasing function of $\mu$ .)

Exercise. Show that $\pi (\mu _{0})=\alpha$ if $\mathbb {P} (Z\geq k)=\alpha$ .

Solution

Proof. Assume $\mathbb {P} (Z\geq k)=\alpha$ . Then, $\pi (\mu _{0})=\mathbb {P} (Z\geq k+0)=\mathbb {P} (Z\geq k)=\alpha$ .

$\Box$

Ideally, we want to make both $\alpha (\theta )$ and $\beta (\theta )$ arbitrarily small. But this is generally impossible. To understand this, we can consider the following extreme examples:

Set the rejection region $R$ to be $S=\{\mathbf {x} \}$ , which is the set of all possible observations of random samples. Then, $\pi (\theta )=1$ for each $\theta \in \Theta$ . From this, of course we have $\beta (\theta )=0$ , which is nice. But the serious problem is that $\alpha (\theta )=1$ due to the mindless rejection.
Another extreme is setting the rejection region $R$ to be the empty set $\varnothing$ . Then, $\pi (\theta )=0$ for each $\theta \in \Theta$ . From this, we have $\alpha (\theta )=0$ , which is nice. But, again the serious problem is that $\beta (\theta )=1$ due to the mindless acceptance.

We can observe that to make $\alpha (\theta )$ ( $\beta (\theta )$ ) to be very small, it is inevitable that $\beta (\theta )$ ( $\alpha (\theta )$ ) will increase consequently, due to accepting (rejecting) "too much". As a result, we can only try to minimize the probability of making one type of error, holding the probability of making another type of error controlled.

Now, we are interested in knowing that which type of errors should be controlled. To motivate the choice, we can again consider the analogy of legal principle of presumption of innocence. In this case, type I error means proving an innocent guilty, and type II error means acquitting a guilty person. Then, as suggested by Blackstone's ratio, type I error is more serious and important than type II error. This motivates us to control the probability of type I error, i.e., $\alpha (\theta )$ , at a specified small value $\alpha ^{*}$ , so that we can control the probability of making this more serious error. After that, we consider the tests that "control the type I error probability at this level", and the one with the smallest $\beta (\theta )$ is the "best" one (in the sense of probability of making errors).

To describe "control the type I error probability at this level" in a more precise way, let us define the following term.

Definition. (Size of a test) A test with power function $\pi (\theta )$ is a size $\alpha$ test if $\sup _{\theta \in \Theta _{0}}\pi (\theta )=\alpha$ where $0\leq \alpha \leq 1$ .

Remark.

Supremum is similar to maximum, and in "nice" situations (you may assume the situations here are "nice"), supremum is the same as maximum. Hence, choosing the supremum of $\pi (\theta )$ over $\theta \in \Theta _{0}$ as the size of a test means that the size of a test gives its maximum probability of type I error (reject $H_{0}$ when $H_{0}$ is true), considering all situations, i.e., all different possible values of $\theta$ that makes $H_{0}$ true.
Intuitively, we choose the maximum probability of type I error to be the size so that the size can tell us how probable type I error occurs in the worst situation, to show that how "well" can the test control the type I error ^[4].
Special case: if $\Theta _{0}$ contains a single parameter only, say (a known value) $\theta _{0}$ (i.e., $H_{0}$ is a simple hypothesis which states that $\theta =\theta _{0}$ ), then $\alpha =\pi (\theta _{0})$ .
$\alpha$ is also called the level of significance or significance level (these terms are related to the concept of statistical (in)significance, which is in turn related to the concept of $p$ -value. We will discuss these later.).
The " $\alpha$ " here and the " $\alpha$ " in the confidence coefficient can actually be interpreted as the "same", by connecting confidence intervals with hypothesis testing. We will discuss these later.
Because of this definition, the null hypothesis conventionally contains an equality (i.e. in the form of $\theta =\theta _{0},\theta \geq \theta _{0}$ or $\theta \leq \theta _{0}$ ), since the size of the test can be calculated more conveniently if this is the case.

So, using this definition, controlling the type I error probability at a particular level $\alpha$ means that the size of the test should not exceed $\alpha$ , i.e., $\sup _{\theta \in \Theta _{0}}\pi (\theta )\leq \alpha$ (in some other places, such test is called a level $\alpha$ test.

Example. Consider the normal distribution ${\mathcal {N}}(\mu ,1)$ (with the parameter space: $\Theta =\{\mu :\mu =20{\text{ or }}21\}$ ) , and the hypotheses $H_{0}:\mu =20\quad {\text{vs.}}\quad H_{1}:\mu =21$ Let $X_{1},\dotsc ,X_{10}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,1)$ , and the corresponding realizations are $x_{1},\dotsc ,x_{10}$ . Suppose the rejection region is $\{(x_{1},\dotsc ,x_{10}):{\overline {x}}\geq k\}$ .

(a) Find $k$ such that the significance level of the test is $\alpha =0.05$ .

(b) Calculate the type II error probability $\beta$ . To have the type II error probability $\beta \leq 0.05$ , what is the minimum sample size (with the same rejection region)?

Solution.

(a) In order for the significance level to be 0.05, we need to have $\sup _{\mu \in \Theta _{0}}\pi (\mu )=0.05.$ But $\Theta _{0}=\{20\}$ . So, this means $0.05=\pi (20)=\mathbb {P} _{\mu =20}({\overline {X}}\geq k)=\mathbb {P} \left({\frac {{\overline {X}}-20}{1/{\sqrt {10}}}}\geq {\frac {k-20}{1/{\sqrt {10}}}}\right)=\mathbb {P} (Z\geq {\sqrt {10}}(k-20))$ where $Z\sim {\mathcal {N}}(0,1)$ . We then have ${\sqrt {10}}(k-20)=z_{0.05}\approx 1.64\implies k\approx 20.51861.$

(b) The type II error probability is $\beta \approx 1-\mathbb {P} _{\mu =21}({\overline {X}}\geq 20.51861)=1-\mathbb {P} \left({\frac {{\overline {X}}-21}{1/{\sqrt {10}}}}\geq {\frac {20.51861-21}{1/{\sqrt {10}}}}\right)\approx 1-\mathbb {P} (Z\geq -1.522)=\mathbb {P} (Z<-1.522)\approx 0.06426.$ ( $Z\sim {\mathcal {N}}(0,1)$ ) With the sample size $n$ , the type II error probability is $\beta \approx \mathbb {P} \left(Z<{\sqrt {n}}(20.51861-21)\right)$ When the sample size $n$ increases, ${\sqrt {n}}(20.51861-21)$ will become more negative, and hence the type II error probability decreases. It follows that $\mathbb {P} (Z<{\sqrt {n^{*}}}(20.51861-21)\leq 0.05\implies {\sqrt {n}}(20.51861-21)\geq -1.64\implies n\geq 11.603.$ Hence, the minimum sample size is 12.

Exercise. Calculate the type I error probability and type II error probability when the sample size is 12 (the rejection region remains unchanged).

Solution

The type II error probability is $\mathbb {P} (Z<{\sqrt {12}}(20.51861-21))\approx \mathbb {P} (Z<-1.668)\approx 0.04746.$ The type I error probability is $\mathbb {P} (Z\geq {\sqrt {12}}(20.51861-20))\approx \mathbb {P} (Z\geq 1.797)\approx 0.0359.$ So, with the same rejection region and different sample size, the significance level (type I error probability in this case) of the test changes.

For now, we have focused on using rejection region to conduct hypothesis tests. But this is not the only way. Alternatively, we can make use of $p$ -value.

Definition. ( $p$ -value) Let $T(\mathbf {x} )$ be an observed value of a test statistic $T(\mathbf {X} )=T(X_{1},\dotsc ,X_{n})$ in a hypothesis test.

Case 1: The test is left-tailed. Then, the $p$ -value is $\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T(\mathbf {x} ))$ .
Case 2: The test is right-tailed. Then, the $p$ -value is $\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\geq T(\mathbf {x} ))$ .
Case 3: The test is two-tailed.

Subcase 1: The distribution of $T$ is symmetric about zero (when $H_{0}$ is true). Then, the $p$ -value is $\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(|T(\mathbf {X} )|\geq |T(\mathbf {x} )|)$ .
Subcase 2: The distribution of $T$ is not symmetric about zero (when $H_{0}$ is true). Then, the $p$ -value is $2\min {\bigg \{}\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T(\mathbf {x} )),\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\geq T(\mathbf {x} )){\bigg \}}$ .

Remark.

$p$ -value can be interpreted as the probability of a test statistic to be at least as "extreme" as the observed test statistic in a hypothesis test, when $H_{0}$ is true. Here, "extreme" is in favour of $H_{1}$ , i.e., the "direction of extreme" is towards the "direction of tail" for the test (when the test statistic is more towards the tail direction, it is more likely to fall in the rejection region, and therefore reject $H_{0}$ and accept $H_{1}$ ).
So, when the $p$ -value is small, it implies that the observed value of the test statistic is already very "extreme", causing the test statistic to be unlikely to be even more "extreme" than the observed value.
In general, it can be quite difficult to compute $p$ -values manually. Thus, $p$ -values are often computed using software, e.g. R.
For case 3 subcase 1, consider the following diagram:

            pdf of T(X)
             |
           *---*
          /  |  \
         /   |   \
        /    |    \
       /|    |    |\
      /#|    |    |#\
     /##|    |    |##\    
 ---*###|    |    |###*---
 #######|    |    |#######
-------------*-------------
      ^            ^
<---->|   =====>   |<---->             T(x)<0
     T(x)         -T(x) 
"more extreme"          "more extreme"

T(X)<=T(x)          T(X)>=-T(x)          ====> |T(X)|>=|T(x)| ( T(x)=-|T(x)|, -T(x)=|T(x)|)

<-->^                ^<-->
    |                |                 T(x)>0
   -T(x)            T(x) 

T(X)<=-T(x)         T(X)>=T(x)            ====> |T(X)|>=|T(x)| (-T(x)=-|T(x)|, T(x)=|T(x)|)

For case 3 subcase 2, consider the following diagram:

                  pdf of T
    |
    |     /*----*
    |    /|      \
    |   /#|       \
    |  /##|        \       
    | /###|         *---|--------*
    |/####|             |#########\
----*------------------------------
        ^                      
        |
        T(x)
    |---|-------------------------|
  T(X)<=T(x)   T(X)>=T(x)       &&&&&: T(X)>= -T(x)
  choose
                              ^
                              |
                              t
    |-------------------------|---|
            T(X)<=T(x)      T(X)>=T(x)
     &&&:                      choose
    T(X)<=-T(x)

We can observe that the observed value

t

may lie at the left tail or right tail. In either case, for

T

to be "more extreme", the resulting inequality corresponds to the one with smaller probability. Thus, we have "

\min

". But we also need to consider the "extreme" in another tail. It is intuitive to say that when

T

is more extreme than

-t

(in another tail), then

T

should be also considered as "more extreme". Thus, there is a "

2\times

"

The following theorem allows us to use $p$ -value for hypothesis testing.

Theorem. Let $T(\mathbf {x} )=T(x_{1},\dotsc ,x_{n})$ be an observed value of a test statistic $T(\mathbf {X} )=T(X_{1},\dotsc ,X_{n})$ in a hypothesis test. The null hypothesis $H_{0}$ is rejected at the significance level $\alpha$ if and only if the $p$ -value is less than or equal to $\alpha$ .

Proof. (Partial) We can prove "if" and "only if" directions at once. Let us first consider the case 1 in the definition of $p$ -value. By definitions, $p$ -value is $\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T^{*}(\mathbf {x} ))$ and $\alpha =\sup _{\theta \in \Theta _{0}}\pi (\theta )=\sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T^{*}(\mathbf {x} ))$ (Define $T^{*}(\mathbf {X} )$ such that $T(\mathbf {X} )\leq T^{*}(\mathbf {x} )\iff (X_{1},\dotsc ,X_{n})\in R$ .). Then, we have ${\begin{aligned}p{\text{-value}}\leq \alpha &\iff \sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T(\mathbf {x} ))\leq \sup _{\theta \in \Theta _{0}}\mathbb {P} _{\theta }(T(\mathbf {X} )\leq T^{*}(\mathbf {x} ))\\&\iff T(\mathbf {x} )\leq T^{*}(\mathbf {x} )&({\text{by some omitted arguments and the monotonicity of cdf}})\\&\iff (x_{1},\dotsc ,x_{n})\in \{(y_{1},\dotsc ,y_{n}):T(y_{1},\dotsc ,y_{n})\leq T^{*}(\mathbf {x} )\}&(x_{1},\dotsc ,x_{n}{\text{ are realizations of }}X_{1},\dotsc ,X_{n}{\text{ respectively}})\\&\iff (x_{1},\dotsc ,x_{n})\in R&({\text{defined above}})\\&\iff H_{0}{\text{ is rejected at significance level }}\alpha .&({\text{the test with power function }}\pi (\theta ){\text{ is size }}\alpha {\text{ test}})\end{aligned}}$ For other cases, the idea is similar (just the directions of inequalities for $T$ are different).

$\Box$

Remark.

From this, we can observe that $p$ -value can be used to report the test result in a more "continuous" scale, instead of just a single decision "accept $H_{0}$ " or "reject $H_{0}$ ", in the sense that if $p$ -value is "much smaller" than the significance level $\alpha$ , then we have a "stronger" evidence for rejecting $H_{0}$ (stronger in the sense that even if the significance level is very low (a very strict requirement on type I error), $H_{0}$ can still be rejected).
Also, reporting a $p$ -value allows readers to choose an appropriate significance level $\alpha$ themselves, and compare the $p$ -value with $\alpha$ , and therefore make their own decisions, which are not necessary the same as the decisions made in the test report (since readers may choose a different significance level from that of the report).
Here, let us also mention about the concept of statistical significance. An observation has statistical significance if it is "unlikely" to happen (i.e., the observed value is quite "extreme") when the null hypothesis is true. To be more precise, in terms of $p$ -value, this means an observed value of a test statistic is statistically significant if $p$ -value is less than or equal to $\alpha$ , and we say that the observed value is statistically insignificant otherwise. Thus, $\alpha$ can be interpreted as the benchmark of "significant" or "extremeness", and hence the name significance level.

Example. Recall the setting of a previous example: consider the normal distribution ${\mathcal {N}}(\mu ,1)$ (with the parameter space for $\mu$ : $\Theta =\{20,21\}$ ) , and the hypotheses $H_{0}:\mu =20\quad {\text{vs.}}\quad H_{1}:\mu =21$ Let $X_{1},\dotsc ,X_{10}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,1)$ , and the corresponding realizations are $x_{1},\dotsc ,x_{10}$ .

At the significance level $\alpha =0.05$ , we have determined that the rejection region is $R=\{(y_{1},\dotsc ,y_{10}):{\overline {y}}\geq 20.51861\}$ . Suppose it is observed that ${\overline {x}}=20.5$ .

(a) Use the rejection region to determine whether we should reject $H_{0}$ .

(b) Use $p$ -value to determine whether we should reject $H_{0}$ .

Solution.

(a) Since ${\overline {x}}=20.5<20.51861$ , we have $(x_{1},\dotsc ,x_{10})\in R^{c}$ . Thus, we should not reject $H_{0}$ .

(b) Since the test is right-tailed, the $p$ -value is $\sup _{\mu \in \{20\}}\mathbb {P} _{\mu }({\overline {X}}\geq {\overline {x}})=\mathbb {P} _{\mu =20}({\overline {X}}\geq 20.5)=\mathbb {P} \left({\frac {{\overline {X}}-20}{1/{\sqrt {10}}}}\geq {\frac {20.5-20}{1/{\sqrt {10}}}}\right)\approx \mathbb {P} (Z\geq 1.581)\approx 0.05705>\alpha =0.05$ where $Z\sim {\mathcal {N}}(0,1)$ . Thus, $H_{0}$ should not be rejected.

Exercise.

Remark.

From this, we can notice that one can "manipulate" the decision by changing the significance level. In fact, if one sets the significance level to be 1 , then $H_{0}$ must be rejected (since $p$ -value is a probability which must be less than or equal to 1). But such significance level is meaningless, since it means that the type I error probability can be as high as 1, so such test has a large error and the result is not reliable anyway.
On the other hand, if one sets the significance level to be 0, then $H_{0}$ must not be rejected (unless the $p$ -value is exactly zero, which is very unlikely, since zero $p$ -value means that the observation is the most extreme one, so it is (almost) never for the test statistic to be at least as extreme as the observation.).

Evaluating a hypothesis test

After discussing some basic concepts and terminologies, let us now study some ways to evaluate goodness of a hypothesis test. As we have previously mentioned, we want the probability of making type I errors and type II errors to be small, but we have mentioned that it is generally impossible to make both probabilities to be arbitrarily small. Hence, we have suggested to control the type I error, using the size of a test, and the "best" test should the one with the smallest probability of making type II error, after controlling the type I error.

These ideas lead us to the following definitions.

Definition. (Power of a test) The power of a test is the probability of rejecting $H_{0}$ when $H_{0}$ is false. That is, the power is $1-\beta$ , if the probability of making type II error is $\beta$ .

Using this definition, instead of saying "best" test (test with the smallest type II error probability), we can say "a test with the most power", or in other words, the "most powerful test".

Definition. (Uniformly most powerful test) A test $\varphi$ with rejection region $R$ is a uniformly most powerful (UMP) test with size $\alpha$ for testing $H_{0}:\theta \in \Theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta \in \Theta _{1}$ ( $\Theta _{1}=\Theta _{0}^{c}$ ) if

(size of $\varphi$ ) $\sup _{\theta \in \Theta _{0}}\pi _{\varphi }(\theta _{0})=\alpha$ , and
(UMP) $\pi _{\varphi }(\theta _{1})\geq \pi _{\psi }(\theta _{1})$ for each $\theta \in \Theta _{1}$ and for each test $\psi$ with rejection region $R^{*}\neq R$ and $\pi _{\psi }(\theta _{0})\leq \alpha$ ( $\pi _{\psi }(\cdot )$ is the power function of the test $\psi$ ).

( $\pi _{\varphi }(\cdot )$ and $\pi _{\psi }(\cdot )$ are the power functions of the tests $\varphi$ and $\psi$ respectively.)

Remark.

The rejection region $R$ is sometimes called a best rejection region of size $\alpha$ .
In other words, a test is UMP with size $\alpha$ if it has size $\alpha$ and its power is the largest among all other tests with size less than or equal to $\alpha$ , for each $\theta \in \Theta _{1}$ . The adverb "uniformly" emphasizes that this is true for each $\theta \in \Theta _{1}$ .

Since the power is largest for each value of $\theta \in \Theta _{1}$ , the rejection region $R$ of the UMP test does not depend on the choice of $\theta \in \Theta _{1}$ , that is, regardless of the chosen value of $\theta \in \Theta _{1}$ , the rejection region is the same. This is expected since the rejection region $R$ is not supposed to be changed when the choice of $\theta \in \Theta _{1}$ is different. The rejection region $R$ (fixed) should always be the best, for each $\theta \in \Theta _{1}$ .

If $H_{1}$ is simple, we may simply call the UMP test as the most powerful (MP) test.

Constructing a hypothesis test

There are many ways of constructing a hypothesis test, but of course not all are good (i.e., "powerful"). In the following, we will provide some common approaches to construct hypothesis tests. In particular, the following lemma is very useful for constructing a MP test with size $\alpha$ .

Neyman-Pearson lemma

Lemma. (Neyman-Pearson lemma) Let $X_{1},\dotsc ,X_{n}$ be a random sample from a population with pdf or pmf $f(x;\theta )$ ( $\theta$ may be a parameter vector, and the parameter space is $\Theta =\{\theta _{0},\theta _{1}\}$ ). Let ${\mathcal {L}}(\cdot )$ be the likelihood function. Then a test $\varphi$ with rejection region $R=\left\{(x_{1},\dotsc ,x_{n}):{\frac {{\mathcal {L}}(\theta _{0};\mathbf {x} )}{{\mathcal {L}}(\theta _{1};\mathbf {x} )}}\leq k\right\}$ and size $\alpha$ is the MP test with size $\alpha$ for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta =\theta _{1}$ where $k$ is a value determined by the size $\alpha$ .

Proof. Let us first consider the case where the underlying distribution is continuous. With the assumption that the size of $\varphi$ is $\alpha$ , the "size" requirement for being a UMP test is satisfied immediately. So, it suffices to show that $\varphi$ satisfies the "UMP" requirement for being a MP test.

Notice that in this case, " $\Theta _{1}$ " is simply $\{\theta _{1}\}$ . So, for every test $\psi$ with rejection region $R^{*}\neq R$ and ${\color {purple}\pi _{\psi }(\theta _{0})\leq \alpha }$ , we will proceed to show that $\pi _{\varphi }(\theta _{1})\geq \pi _{\psi }(\theta _{1})$ .

Since ${\begin{aligned}\pi _{\varphi }(\theta _{1})-\pi _{\psi }(\theta _{1})&=\mathbb {P} _{\theta _{1}}((X_{1},\dotsc ,X_{n})\in R)-\mathbb {P} _{\theta _{1}}((X_{1},\dotsc ,X_{n})\in R^{*})\\&=\int \dotsi \int _{R}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}-\int \dotsi \int _{R^{*}}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}\\&={\color {blue}\int \dotsi \int _{R}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}-\int \dotsi \int _{R\cap R^{*}}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}}-\left({\color {red}\int \dotsi \int _{R^{*}}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}-\int \dotsi \int _{R\cap R^{*}}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}}\right)\\&={\color {blue}\int \dotsi \int _{R\setminus R^{*}}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}}-{\color {red}\int \dotsi \int _{R^{*}\setminus R}^{}{\mathcal {L}}(\theta _{1};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}}\\&\geq {\color {blue}{\frac {1}{k}}}\int \dotsi \int _{R\setminus R^{*}}^{}{\color {blue}{\mathcal {L}}(\theta _{0};\mathbf {x} )}\,dx_{n}\cdots \,dx_{1}-{\color {red}{\frac {1}{k}}}\int \dotsi \int _{R^{*}\setminus R}^{}{\color {red}{\mathcal {L}}(\theta _{0};\mathbf {x} )}\,dx_{n}\cdots \,dx_{1}\qquad ({\text{In }}R,{\color {blue}{\mathcal {L}}(\theta _{1};\mathbf {x} )\geq {\frac {1}{k}}{\mathcal {L}}(\theta _{0};\mathbf {x} )}.{\text{ In }}R^{c},{\mathcal {L}}(\theta _{1};\mathbf {x} )<{\frac {1}{k}}{\mathcal {L}}(\theta _{0};\mathbf {x} )\iff {\color {red}-{\mathcal {L}}(\theta _{1};\mathbf {x} )>-{\frac {1}{k}}{\mathcal {L}}(\theta _{0};\mathbf {x} )})\\&={\frac {1}{k}}\int \dotsi \int _{R\setminus R^{*}}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}+{\frac {1}{k}}\int \dotsi \int _{R\cap R^{*}}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}-\left({\frac {1}{k}}\int \dotsi \int _{R^{*}\setminus R}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}+{\frac {1}{k}}\int \dotsi \int _{R\cap R^{*}}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}\right)\\&={\frac {1}{k}}\int \dotsi \int _{R}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}-{\frac {1}{k}}\int \dotsi \int _{R^{*}}^{}{\mathcal {L}}(\theta _{0};\mathbf {x} )\,dx_{n}\cdots \,dx_{1}\\&={\frac {1}{k}}{\bigg (}{\color {brown}\underbrace {\mathbb {P} _{\theta _{0}}((X_{1},\dotsc ,X_{n})\in R)} _{=\alpha }}-{\color {purple}\underbrace {\mathbb {P} _{\theta _{0}}((X_{1},\dotsc ,X_{n})\in R^{*})} _{\leq \alpha }}{\bigg )}\\&\geq {\frac {1}{k}}(\alpha -\alpha )=0,\end{aligned}}$ we have $\pi _{\phi }(\theta _{1})\geq \pi _{\psi }(\theta _{1})$ as desired.

For the case where the underlying distribution is discrete, the proof is very similar (just replace the integrals with sums), and hence omitted.

$\Box$

Remark.

Sometimes, we call ${\frac {{\mathcal {L}}(\theta _{0};\mathbf {x} )}{{\mathcal {L}}(\theta _{1};\mathbf {x} )}}$ as the likelihood ratio.
In fact, the MP test constructed by Neyman-Pearson lemma is a variant from the likelihood-ratio test, which is more general in the sense that the likelihood-ratio test can also be constructed for composite null and alternative hypotheses, apart from simple null and alternative hypotheses directly. But, the likelihood-ratio test may not be (U)MP. We will discuss likelihood-ratio test later.
For the discrete distribution, it may be impossible to determine a $k$ for the rejection $R$ for some $\alpha$ . In this case, we call such $\alpha$ to be not attainable.
Intuitively, this test means that we should reject $H_{0}$ when "likelihood" of $H_{0}$ ( ${\mathcal {L}}(\theta _{0};\mathbf {x} )$ ) is not as large as the "likelihood" of $H_{1}$ ( ${\mathcal {L}}(\theta _{1};\mathbf {x} )$ ) ( ${\mathcal {L}}(\theta _{0};\mathbf {x} )\leq k{\mathcal {L}}(\theta _{0};\mathbf {x} )$ ), with respect to the observed samples. For the meaning of "not as large as", it depends on the size $\alpha$ .

Intuitively, we will expect that $k$ should be a positive value that is strictly less than 1, so that $H_{0}$ is "less likely" than $H_{1}$ . This is usually, but not necessarily, the case. Particularly, when the size $\alpha$ is large, $k$ may be greater than 1.

Typically, to determine the value of $k$ , we need to transform " ${\frac {{\mathcal {L}}(\theta _{0};\mathbf {x} )}{{\mathcal {L}}(\theta _{1};\mathbf {x} )}}\leq k$ " to another equivalent inequality for which the probability under $H_{0}$ is easier to be calculated.

It must be equivalent, so that its probability under $H_{0}$ is the same as probability of " ${\frac {{\mathcal {L}}(\theta _{0};\mathbf {x} )}{{\mathcal {L}}(\theta _{1};\mathbf {x} )}}\leq k$ " under $H_{0}$ . As a result, during the process of the transformation, it is better to use " $\iff$ ", instead of just " $\implies$ ", or even writing different inequalities line by line.

If $\theta$ is a vector, then $\theta _{0}$ and $\theta _{1}$ should also be vectors.

Even if the hypotheses involved in the Neyman-Pearson lemma are simple, with some conditions, we can use the lemma to construct a UMP test for testing composite null hypothesis against composite alternative hypothesis. The details are as follows: For testing $H_{0}:\theta \leq \theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta >\theta _{0}$

Find a MP test $\varphi$ with size $\alpha$ , for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta =\theta _{1}>\theta _{0}$ using the Neyman-Pearson lemma, where $\theta _{1}$ is an arbitrary value such that $\theta _{1}>\theta _{0}$ .
If the rejection region $R$ does not depend on $\theta _{1}$ , then the test $\varphi$ has the greatest power for each $\theta \in \Theta _{1}=\{\vartheta :\vartheta >\theta _{0}\}$ . So, the test $\varphi$ is a UMP test with size $\alpha$ for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta >\theta _{0}$
If we can further show that $\sup _{\theta \leq \theta _{0}}\pi _{\varphi }(\theta )=\alpha =\pi _{\varphi }(\theta _{0})$ , then it means that the size of the test $\varphi$ is still $\alpha$ , even if the null hypothesis is changed to $H_{0}:\theta \leq \theta$ . So, after changing $H_{0}:\theta =\theta _{0}$ to $H_{0}:\theta \leq \theta _{0}$ and not changing $H_{1}$ (also adjusting the parameter space) for the test $\varphi$ , the test $\varphi$ still satisfies the "MP" requirement (because of not changing $H_{1}$ , so the result in step 2 still applies), and also the test $\varphi$ will satisfy the "size" requirement (because of changing $H_{0}$ in this way). Hence, the test $\varphi$ is a UMP test with size $\alpha$ for testing $H_{0}:\theta \leq \theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta >\theta _{0}$ .

For testing $H_{0}:\theta \geq \theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta <\theta _{0}$ , the steps are similar. But in general, there is no UMP test for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta \neq \theta _{0}$ .

Of course, when the condition in step 3 holds but that in step 2 does not hold, the test $\varphi$ in step 1 is a UMP test with size $\alpha$ for testing $H_{0}:\theta \leq \theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta =\theta _{1}$ where $\theta _{1}$ is a constant (which is larger than $\theta _{0}$ , or else $H_{1}$ and $H_{0}$ are not disjoint). However, the hypotheses are generally not in this form.

Example. Let $X_{1},\dotsc ,X_{10}$ be a random sample from the normal distribution ${\mathcal {N}}(\mu ,1)$ .

(a) Construct a MP test $\varphi$ with size 0.05 for testing $H_{0}:\mu =20\quad {\text{vs.}}\quad H_{1}:\mu =21$ .

(b) Hence, show that the test $\varphi$ is also a UMP test with size 0.05 for testing $H_{0}:\mu =20\quad {\text{vs.}}\quad H_{1}:\mu >20$ .

(c) Hence, show that the test $\varphi$ is also a UMP test with size 0.05 for testing $H_{0}:\mu \leq 20\quad {\text{vs.}}\quad H_{1}:\mu >20$ .

Solution. (a) We can use the Neyman-Pearson lemma. First, consider the likelihood ratio ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(21)}}={\frac {{\cancel {\left({\frac {1}{\sqrt {2\pi (1)}}}\right)^{10}}}\prod _{i=1}^{10}\exp \left(-{\frac {(x_{i}-20)^{2}}{2}}\right)}{{\cancel {\left({\frac {1}{\sqrt {2\pi (1)}}}\right)^{10}}}\prod _{i=1}^{10}\exp \left(-{\frac {(x_{i}-21)^{2}}{2}}\right)}}=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}(x_{i}-20)^{2}-(x_{i}-21)^{2}{\big ]}\right)=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}{\cancel {x_{i}^{2}}}-40x_{i}+400{\cancel {-x_{i}^{2}}}+42x_{i}-441{\big ]}\right)=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}2x_{i}-41{\big ]}\right)=\exp \left({\frac {41}{2}}-\sum _{i=1}^{10}x_{i}\right).$ Now, we have ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(21)}}\leq k'\iff \exp \left({\frac {41}{2}}-10{\overline {x}}\right)\leq k'\iff -10{\overline {x}}\leq k''\iff {\overline {x}}\geq k$ where $k,k',k''$ are some constants. To find $k$ , consider the size 0.05: $0.05=\mathbb {P} _{\mu =20}({\overline {X}}\geq k)=\mathbb {P} _{\mu =20}\left({\frac {{\overline {X}}-20}{1/{\sqrt {10}}}}\geq {\frac {k-20}{1/{\sqrt {10}}}}\right)=\mathbb {P} (Z\geq {\sqrt {10}}(k-20)).$ ( $Z\sim {\mathcal {N}}(0,1)$ ) Hence, we have ${\sqrt {10}}(k-20)\approx 1.64\implies k\approx 20.51861$ . Now, we can construct the rejection region: $R=\{(x_{1},\dotsc ,x_{n}):{\overline {x}}\geq 20.51861\},$ and the test $\varphi$ with the rejection region $R$ is a MP test with size 0.05 for testing $H_{0}:\mu =20\quad {\text{vs.}}\quad \mu =21$ .

(b)

Proof. Let $\mu _{1}$ be an arbitrary value such that $\mu _{1}>20$ . Then, we can show that (see the following exercise) ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(\mu _{1})}}\leq k'\iff {\overline {x}}\geq k$ where $k,k'$ are some constants (may be different from the above constants). Since $H_{0}$ here is the same as $H_{0}$ in (a), the rejection region constructed is also $R=\{(x_{1},\dotsc ,x_{n}):{\overline {x}}\geq 20.51861\}.$ Notice that $R$ does not depend on the value of $\mu _{1}$ . It follows that the test $\varphi$ is a UMP test with size 0.05 for testing $H_{0}:\mu =20\quad {\text{vs.}}\quad \mu >20$ .

$\Box$

(c)

Proof. It suffices to show that $\sup _{\mu \leq 20}\pi _{\varphi }(\mu )=0.05{\overset {\text{(a)}}{=}}\pi _{\varphi }(20)$ . First let us consider the power function $\pi _{\varphi }(\mu )=\mathbb {P} _{\mu }({\overline {X}}\geq 20.51861)=\mathbb {P} _{\mu }(Z\geq {\sqrt {10}}(20.51861-\mu ))=1-\Phi ({\sqrt {10}}(20.51861-\mu ))$ where $\Phi (\cdot )$ is the cdf of $Z\sim {\mathcal {N}}(0,1)$ . Now, since when $\mu$ increases, ${\sqrt {10}}(20.51861-\mu )$ decreases and hence $\Phi ({\sqrt {10}}(20.51861-\mu ))$ decreases, it follows that the power function $\pi _{\varphi }(\mu )$ is a strictly increasing function of $\mu$ . Hence, $\sup _{\mu \leq 20}\pi _{\varphi }(\mu )=\max _{\mu \leq 20}\pi _{\varphi }(\mu )=\pi _{\varphi }(20)=0.05.$ Then, the result follows.

$\Box$

Exercise. Show that ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(\mu _{1})}}\leq k'\iff {\overline {x}}\geq k$ for every $\mu _{1}>20$ .

Solution

Proof. First, consider the likelihood ratio ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(\mu _{1})}}=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}(x_{i}-20)^{2}-(x_{i}-\mu _{1})^{2}{\big ]}\right)=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}{\cancel {x_{i}^{2}}}-40x_{i}+400{\cancel {-x_{i}^{2}}}+2\mu _{1}x_{i}-\mu _{1}^{2}{\big ]}\right)=\exp \left(-{\frac {1}{2}}\sum _{i=1}^{10}{\big [}(2\mu _{1}-40)x_{i}+400-\mu _{1}^{2}{\big ]}\right)=\exp \left({\frac {\mu _{1}^{2}-400}{2}}-(\mu _{1}-20)\sum _{i=1}^{10}x_{i}\right).$ Then, we have ${\frac {{\mathcal {L}}(20)}{{\mathcal {L}}(\mu _{1})}}\leq k'\iff \exp \left({\frac {\mu _{1}^{2}-400}{2}}-(\mu _{1}-20)n{\overline {x}}\right)\leq k'\iff -(\mu _{1}-20)n{\overline {x}}\leq k''\iff {\overline {x}}\geq k.$ (The last equivalence follows since $\mu _{1}>20$ .)

$\Box$

Remark.

This rejection region has appeared in a previous example.

Now, let us consider another example where the underlying distribution is discrete.

Example. Let $X$ be a discrete random variable. Its pmf is given by ${\begin{array}{c|ccccccccc}\theta &x&1&2&3&4&5&6&7&8\\\hline 0&f(x;\theta )&0&0.02&0.02&0.02&0.02&0.02&0.02&0.88\\1&f(x;\theta )&0.01&0.02&0.03&0.04&0.05&0&0.06&0.79\\\end{array}}$ (Notice that the sum of values in each row is 1. The parameter space is $\Theta =\{0,1\}$ .) Given a single observation $x$ , construct a MP test with size 0.1 for testing $H_{0}:\theta =0\quad {\text{vs.}}\quad H_{1}:\theta =1$ .

Solution. We use the Neyman-Pearson lemma. First, we calculate the likelihood ratio $f(x;0)/f(x;1)$ for each value of $x$ : ${\begin{array}{ccccccccc}x&1&2&3&4&5&6&7&8\\\hline {\frac {f(x;0)}{f(x;1)}}&0&1&0.667&0.5&0.4&{\text{undefined}}&0.333&1.114\end{array}}$ For convenience, let us sort the likelihood ratios in ascending order (we put the undefined value at the last): ${\begin{array}{ccccccccc}x&1&7&5&4&3&2&8&6\\\hline {\frac {f(x;0)}{f(x;1)}}&0&0.333&0.4&0.5&0.667&1&1.114&{\text{undefined}}\end{array}}$ By Neyman-Pearson lemma, the MP test with size 0.1 for testing $H_{0}:\theta =0\quad {\text{vs.}}\quad H_{1}:\theta =1$ is a test with size 0.1 and rejection region $R=\left\{x:{\frac {f(x;0)}{f(x;1)}}\leq k\right\}.$ So, it remains to determine $R$ . Since the size is 0.1, we have $0.1=\alpha (0)=\mathbb {P} _{\theta =0}(X\in R).$ Notice that $\mathbb {P} _{\theta =0}(X=1)+\mathbb {P} _{\theta =0}(X=7)+\mathbb {P} _{\theta =0}(X=5)+\mathbb {P} _{\theta =0}(X=4)+\mathbb {P} _{\theta =0}(X=3)+\mathbb {P} _{\theta =0}(X=2)=0+0.02+0.02+0.02+0.02+0.02=0.1.$ So, we can choose $k\in [1,1.114)$ (approximately), so that rejection region is $\left\{x:{\frac {f(x;0)}{f(x;1)}}\leq k\right\}=\{1,7,5,4,3,2\}.$

Exercise. Calculate the probability of making type II error for the above test.

Solution

The probability is $\beta (1)=\mathbb {P} _{\theta =1}(X\in R^{c})=\mathbb {P} _{\theta =1}(X=8)+\mathbb {P} _{\theta =1}(X=6)=0.79.$ (Notice that although the test is MP, the type II error probability is still quite large in this case.)

Use Neyman-Pearson lemma to construct another MP test with size 0.05 for testing $H_{0}:\theta =0\quad {\text{vs.}}\quad H_{1}:\theta =1$ .

Solution

It is impossible to construct the MP test with this size using Neyman-Pearson lemma, since we are unable to choose a $k$ such that $\mathbb {P} _{\theta =0}(X\in R)$ . We can choose $k\in [0.333,0.4)$ (approximately) for the size to be 0.04, or choose $k\in [0.4,0.5)$ for the size to be 0.06, but we are impossible to choose a $k$ for the size to be 0.05.

Likelihood-ratio test

Previously, we have suggested using the Neyman-Pearson lemma to construct MPT for testing simple null hypothesis against simple alternative hypothesis. However, when the hypotheses are composite, we may not be able to use the Neyman-Pearson lemma. So, in the following, we will give a general method for constructing tests for any hypotheses, not limited to simple hypotheses. But we should notice that the tests constructed are not necessarily UMPT.

Definition. (Likelihood ratio test) Let $\lambda (\mathbf {x} )={\frac {\sup _{\theta \in \Theta _{0}}{\mathcal {L}}(\theta ;\mathbf {x} )}{\sup _{\theta \in \Theta }{\mathcal {L}}(\theta ;\mathbf {x} )}}$ . The likelihood ratio test with size $\alpha$ for testing $H_{0}:\theta \in \Theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta \in \Theta _{1}$ ( $\Theta _{1}=\Theta _{0}^{c}$ , and $\theta$ may be a vector) has the rejection region $\{\mathbf {x} :\lambda (\mathbf {x} )\leq k\}$ where $k$ is a constant determined by the size $\alpha$ .

Remark.

When $\max _{\theta \in \Theta _{0}}{\mathcal {L}}(\theta ;\mathbf {x} )$ and $\max _{\theta \in \Theta }{\mathcal {L}}(\theta ;\mathbf {x} )$ exist, we have $\lambda (\mathbf {x} )={\frac {{\mathcal {L}}({\hat {\theta }}^{R};\mathbf {x} )}{{\mathcal {L}}({\hat {\theta }};\mathbf {x} )}}$ where ${\hat {\theta }}^{R}$ is the restricted maximum likelihood estimate of $\theta$ in $\Theta _{0}$ , and ${\hat {\theta }}$ is the maximum likelihood estimate of $\theta$ in $\Theta$ . We can assume the existence in the following.
Since $\Theta _{0}\subseteq \Theta$ , ${\mathcal {L}}({\hat {\theta }}^{R};\mathbf {x} )\leq {\mathcal {L}}({\hat {\theta }};\mathbf {x} )$ , so we have $\lambda (\mathbf {x} )\leq 1$ .
Intuitively, when $\lambda (\mathbf {x} )$ is very small, i.e., ${\mathcal {L}}({\hat {\theta }}^{R};\mathbf {x} )\ll {\mathcal {L}}({\hat {\theta }};\mathbf {x} )$ , it suggests that there are many $\theta$ falling $\Theta _{1}$ that are more likely than all $\theta$ in $\Theta _{0}$ . So, $H_{0}$ should be intuitively rejected.
On the other hand, when $\lambda (\mathbf {x} )$ is very close to 1, i.e., ${\mathcal {L}}({\hat {\theta }}^{R};\mathbf {x} )\lesssim {\mathcal {L}}({\hat {\theta }};\mathbf {x} )$ , it suggests that there are very few $\theta$ falling in $\Theta _{1}$ that are more likely than all $\theta$ in $\Theta _{0}$ . So, $H_{0}$ should be intuitively not rejected.
When the simple and alternative hypotheses are simple, the likelihood ratio test will be the same as the test suggested in the Neyman-Pearson lemma.

Relationship between hypothesis testing and confidence intervals

We have mentioned that there are similarities between hypothesis testing and confidence intervals. In this section, we will introduce a theorem suggesting how to construct a hypothesis test from a confidence interval (or in general, confidence set), and vice versa.

Theorem. For each $\theta _{0}\in \Theta$ , let $R(\theta _{0})$ be the rejection region of a size $\alpha$ test for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta \neq \theta _{0}$ . Also, let $x_{1},\dotsc ,x_{n}$ be the corresponding realizations of a random sample $X_{1},\dotsc ,X_{n}$ from the underlying distribution. Furthermore, let $\mathbf {x} =(x_{1},\dotsc ,x_{n})$ and $\mathbf {X} =(X_{1},\dotsc ,X_{n})$ .

Define a set $C(\mathbf {x} )=\{\theta _{0}:\mathbf {x} \in R(\theta _{0})^{c}\}.$ Then, the random set $C(\mathbf {X} )$ is a $1-\alpha$ confidence set of $\theta _{0}$ .

Conversely, let a set $C^{*}(\mathbf {X} )$ be a $1-\alpha$ confidence set of an unknown parameter $\theta$ . For each $\theta _{0}\in \Theta$ , define $R(\theta _{0})=\{\mathbf {x} :\theta _{0}\notin C^{*}(\mathbf {x} )\}.$ Then, $R(\theta _{0})$ is the rejection region of a size $\alpha$ test for testing $H_{0}:\theta =\theta _{0}\quad {\text{vs.}}\quad H_{1}:\theta \neq \theta _{0}$ .

Proof. For the first part, since $R(\theta _{0})$ is the rejection region of the size $\alpha$ test, we have $\mathbb {P} _{\theta _{0}}(\mathbf {X} \in R)=\alpha .$ Hence, the coverage probability for the random set $C(\mathbf {X} )$ is $\mathbb {P} _{\theta _{0}}(\theta _{0}\in C(\mathbf {X} ))=\mathbb {P} _{\theta _{0}}(\mathbf {X} \in R^{c})=1-\alpha ,$ which means that the random set $C(\mathbf {X} )$ is a $1-\alpha$ confidence set of $\theta _{0}$ .

For the second part, by assumption, we have $\mathbb {P} _{\theta }(\theta \in C^{*}(\mathbf {X} ))=1-\alpha .$ So, the size of the test with the rejection region $R(\theta _{0})$ is $\mathbb {P} _{\theta =\theta _{0}}(\mathbf {X} \in R(\theta _{0}))=\mathbb {P} _{\theta =\theta _{0}}(\theta _{0}=\theta \notin C^{*}(\mathbf {X} ))=1-(1-\alpha )=\alpha .$

$\Box$

Remark.

The " $\theta _{0}$ " can take arbitrary value in $\Theta$ . So, one can take $\theta _{0}$ as the unknown parameter of a distribution.
Usually, the first result is more useful. But, the second result justifies our intuition that given a confidence interval of an unknown parameter $\theta$ , if a particular value $\theta _{0}$ lies in the confidence interval, then we are " $1-\alpha$ " confident that $\theta =\theta _{0}$ . Now, from this theorem, we know that we can interpret "being $1-\alpha$ confident that $\theta =\theta _{0}$ " as "accepting $\theta =\theta _{0}$ at the significance level $\alpha$ ".

For example, if a 95% confidence interval of $\theta$ is $[-1,1]$ , and since $0\in [-1,1]$ , we intuitively say that we are 95% confident that $\theta =0$ . Now, we can say, more formally, we accept $\theta =0$ at the significance level $\alpha =0.05$ .
So, the relationship between the confidence coefficient $1-\alpha$ and the significance level $\alpha$ is now clear.
In some places, given some observed values, when a $1-\alpha$ confidence interval of $\theta -\theta '$ includes zero, then $\theta$ has only statistically insignificant difference from $\theta '$ . This saying is natural when we consider the relationship between confidence coefficient and the significance level.

Since 0 is included in the confidence interval, we accept (and do not reject) $\theta -\theta '=0\implies \theta =\theta '$ at the significance level $\alpha =0.05$ . This means that the observed values are statistically insignificant. Hence, we have such saying.

Interval Estimation

Statistics
Hypothesis Testing

Numerical Methods

↑ If $\Theta _{0}$ is empty, then this hypothesis has no meaning at all, so we are not interested in this case.
↑ Thus, a natural measure of "goodness" of a hypothesis test is its "size of errors". We will discuss these later in this chapter.
↑ This is because it is meaningless to condition on " $\theta \in \Theta _{0}$ " or " $H_{0}$ (is true)" which are not random, and hence has probability zero or one. When the probability is zero, the "conditional probability" is not defined. When the probability is one, conditional on it is the same as not conditional on it.
↑ Even when a test has low probability of making type I error for most parameter values in $\Theta _{0}$ , if the test has a high probability of making type I error for a certain parameter value in $\Theta _{0}$ , then this intuitively means that the test does not control the type I error well, right?

[1] If $\Theta _{0}$ is empty, then this hypothesis has no meaning at all, so we are not interested in this case.

[2] Thus, a natural measure of "goodness" of a hypothesis test is its "size of errors". We will discuss these later in this chapter.

[3] This is because it is meaningless to condition on " $\theta \in \Theta _{0}$ " or " $H_{0}$ (is true)" which are not random, and hence has probability zero or one. When the probability is zero, the "conditional probability" is not defined. When the probability is one, conditional on it is the same as not conditional on it.

[4] Even when a test has low probability of making type I error for most parameter values in $\Theta _{0}$ , if the test has a high probability of making type I error for a certain parameter value in $\Theta _{0}$ , then this intuitively means that the test does not control the type I error well, right?

[1]

[2]

[3]

[4]

	0.01
	0.04
	0.06
	0.08
	0.1