Statistics/Preliminaries

Statistics
Preliminaries

This chapter discusses some preliminary knowledge (related to statistics) for the following chapters in the advanced part.

Empirical distribution

Definition. (Random sample) Suppose $X$ is a random variable resulting from a random experiment, with a certain distribution. After repeating this random experiment $n$ independent times, we obtain $n$ independent and identically distributed (iid) random variables, denoted by $X_{1},X_{2},\dotsc ,X_{n}$ , associated with the $n$ outcomes. They are called a random sample from the distribution with sample size $n$ .

Remark.

We usually refer the underlying distribution as a population.
Often, computer is useful for conducting such experiment and repeating it many times.
In particular, a programming language, called R, is commonly used for computational statistics. You may see the wikibook R Programming for more details about it.
As a result, the content discussed in this section (as well as the section about resampling) is quite relevant to computational statistics.

Since all these $n$ random variables follow the same cdf as $X$ , we may expect their distribution should be somewhat similar to the distribution of $X$ , and indeed, this is true. Before showing how this is true, we need to define "the distribution of these $n$ random variables" more precisely, as follows:

Definition. (Empirical distribution) The cdf of empirical distribution, empirical cdf, of a random sample $X_{1},X_{2},\dotsc ,X_{n}$ , denoted by $F_{\color {darkgreen}n}(x)$ , is ${\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\leq x\}$ .

Remark.

$\mathbf {1} \{A\}$ is the indicator function with value 1 if $A$ is true and 0 otherwise.
We can see that $F_{n}(x)$ "assigns" the probability (or "mass") $1/n$ to each of $X_{1},X_{2},\dotsc ,X_{n}$ , and this is indeed a valid cdf.

This is because for each of $X_{1},\dotsc ,X_{n}$ , if it is less than or equal to $x$ , then the corresponding indicator function in the sum is one, and thus a value of " $1/n$ " is contributed to the cdf.
To understand this more clearly, consider the following example.

We can interpret $F_{n}(x)$ as the relative frequency of the event $\{X\leq x\}$ . Recall that the frequentist definition of probability of an event is the "long-term" relative frequency of the event (i.e. the relative frequency of the event after repeating a random experiment infinite number of times). As a result, we will intuitively expect that $F_{n}(x)\approx F(x)$ when $n$ is large.

Example. A random sample of size 5 is taken from an unknown distribution, and the following numbers are obtained:

-1.4, 2.3, 0.8, 1.9, -1.6

(a) Find the empirical cdf.

(b) Let $Y$ be a (discrete) random variable with cdf exactly the same as the empirical cdf in (a). Prove that the pmf of $Y$ (called empirical pmf) is $f_{Y}(y)=\mathbb {P} (Y=y)={\frac {1}{5}},\quad y=-1.6,-1.4,0.8,1.9{\text{ or }}2.3.$ Solution:

(a) First, we order the sample data ascendingly so that we can find the empirical cdf more conveniently:

-1.6, -1.4, 0.8, 1.9, 2.3

The empirical cdf is given by $F_{5}(x)={\begin{cases}0,&x<-1.6;\\1/5,&-1.6\leq x<-1.4;\\2/5,&-1.4\leq x<0.8;\\3/5,&0.8\leq x<1.9;\\4/5,&1.9\leq x<2.3;\\1,&x\geq 2.3.\\\end{cases}}$ Explanations:

After ordering the sample data, we treat each of the number as an observed value of the random sample: $X_{1}=-1.6,X_{2}=-1.4,X_{3}=0.8,X_{4}=1.9,X_{5}=2.3$ .
Then, when $x<1.6$ , none of $X_{1},\dotsc ,X_{5}$ is less than or equal to $x$ . So, all indicator functions involved are zero, and thus the value of the empirical cdf is zero.
When $-1.6\leq x<-1.4$ , only $X_{1}\leq x$ , and thus only the indicator function $\mathbf {1} \{X_{1}\leq x\}=1$ in this case, and all other indicator functions are zero. As a result, the value is ${\frac {\sum _{k=1}^{5}\mathbf {1} \{X_{k}\leq x\}}{5}}={\frac {\mathbf {1} \{X_{1}\leq x\}+0+0+0+0}{5}}={\frac {1}{5}}$ .
Similarly, when $-1.4\leq x<0.8$ , only $X_{1},X_{2}\leq x$ , and thus only the indicator function $\mathbf {1} \{X_{1}\leq x\}=1$ and $\mathbf {1} \{X_{2}\leq x\}=1$ in this case, and all other indicator functions are zero. As a result, the value is ${\frac {\sum _{k=1}^{5}\mathbf {1} \{X_{k}\leq x\}}{5}}={\frac {\mathbf {1} \{X_{1}\leq x\}+\mathbf {1} \{X_{2}\leq x\}+0+0+0}{5}}={\frac {2}{5}}$ .
...
When $x\geq 2.3$ , all $X_{1},\dotsc ,X_{5}\leq x$ . Hence, all indicator functions are one, and thus the value of the empirical cdf is ${\frac {1+1+1+1+1}{5}}=1$ .

(b)

Proof. First, notice that the cdf of $Y$ is $F_{Y}(y)=\mathbb {P} (Y\leq y)=\mathbb {P} (Y<y)+\mathbb {P} (Y=y)=\mathbb {P} (Y<y)+f_{Y}(y)\implies f_{Y}(y)=\mathbb {P} (Y\leq y)-\mathbb {P} (Y<y)$ .

Then, we observe that when $y=-1.6$ , $\mathbb {P} (Y\leq y)=F_{5}(-1.6)=1/5$ , and $\mathbb {P} (Y<y)=\mathbb {P} (Y<-1.6)=0$ (from the empirical cdf). Hence, $f_{Y}(y)={\frac {1}{5}}$ in this case. Similarly, when $y=-1.4$ , $\mathbb {P} (Y\leq y)=F_{5}(-1.4)=2/5$ , and $\mathbb {P} (Y<y)=\mathbb {P} (Y<-1.4)={\frac {1}{5}}$ . Thus, $f_{Y}(y)={\frac {2}{5}}-{\frac {1}{5}}={\frac {1}{5}}$ also in this case. With similar arguments, we can show that $f_{Y}(y)={\frac {1}{5}}$ also when $y=0.8,1.9,{\text{ or }}2.3$ .

$\Box$

Remark.

Observe from (b) that the support of $Y$ contains exactly the numbers in the sample data, which are the realization of the random sample $X_{1},\dotsc ,X_{5}$ . This shows that the probability $1/5$ is "assigned" to each of $X_{1},\dotsc ,X_{5}$ .

Theorem. (Glivenko–Cantelli theorem) As $n\to \infty$ , $\sup _{x\in \mathbb {R} }|F_{n}(x)-F(x)|\to 0$ almost surely (a.s.).

Remark.

$\sup$ stands for supremum of a set (with some technical requirements), which means the least upper bound of the set, which is the least element that is greater or equal to each other element in the set.

The meaning of $\sup _{x\in \mathbb {R} }|F_{n}(x)-F(x)|$ is the least upper bound of the set containing the values of $|F_{n}(x)-F(x)|$ over $x\in \mathbb {R}$ .
The supremum is similar to the concept of maximum (indeed, if maximum exists, then maximum is the same as supremum), but a difference between them is that sometimes supremum exists while maximum does not exist.
For instance the supremum of the set (or interval) $[0,1)$ is 1 (intuitively). However, the maximum of the set $[0,1)$ (i.e. the greatest element in the set) does not exist (notice that 1 is not included in this set) ^[1].

The term "almost surely" means that this happens with probability 1. The details for the reason of calling this "almost surely" instead of "surely" involves some understanding of measure theory, and so is omitted here.
Roughly speaking, from this theorem, we know that $F_{n}(x)$ is a good estimate of $F(x)$ , and an even better estimate of (or "closer to") $F(x)$ when $n$ is large, for every realization $x_{1},\dotsc ,x_{n}$ (each of them is real number), since the least upper bound of the absolute difference already tends to zero, and then we will intuitively expect that every such absolute difference also tends to zero.
This theorem is sometimes referred as the fundamental theorem of statistics, indicating its importance in statistics.

We have mentioned how we can approximate the cdf, and now we would like to estimate the pdf/pmf. Let us first discuss how to estimate the pmf.

For the discrete random variable $X$ , from the empirical cdf, we know that each $X_{1},\dotsc ,X_{n}$ is "assigned" with the probability $1/n$ . Also, considering the previous example, the empirical pmf is $f_{n}(x)={\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}=x\}}{n}}$ .

Remark.

The empirical pmf $f_{n}(x)$ shows the relative frequency of occurrences of $x$ , and therefore can approximate the probability of occurrences of $x$ , which is the long-term relative frequency of occurrences of $x$ .

To discuss the estimation of pdf of continuous random variable, we need to define class intervals first.

Definition. (Class intervals) First, choose an integer $i\geq 1$ , and a sequence of real numbers $c_{0},c_{1},\dotsc ,c_{i}$ such that $c_{0}<c_{1}<\dotsb <c_{i}$ . Then, the class intervals are $(c_{0},c_{1}],(c_{1},c_{2}],\dotsc ,(c_{i-1},c_{i}]$ .

For the continuous random variable $X$ , construct class intervals for $X$ which are a non-overlapped partition of the interval $[X_{\text{min}},X_{\text{max}}]$ , in which $X_{\text{min}}$ and $X_{\text{max}}$ are the minimum and maximum values in the sample. Then, the pdf $f(x)\approx {\frac {F(c_{j})-F(c_{j-1})}{c_{j}-c_{j-1}}},\quad x\in (c_{j-1},c_{j}]{\text{ and }}j=1,2,\dotsc ,i,$ when $c_{j-1}$ and $c_{j}$ are close, i.e. the length of each class interval is small. (Although the union of the above class intervals is $(c_{0},c_{i}]$ and thus the value $c_{0}$ is not included in the interval, it does not matter since the value of the pdf at $c_{0}$ does not affect the calculation of probability.) Here, $c_{0}$ is $X_{\text{min}}$ and $c_{i}$ is $X_{\text{max}}$ .

Since $F(c_{j})-F(c_{j-1})=\mathbb {P} (X\in (c_{j-1},c_{j}])\approx {\color {darkgreen}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{j-1},c_{j}]\}}{n}}}$ is the relative frequency of occurrences of the event $\{X_{k}\in (c_{j-1},c_{j}]\}$ , we can rewrite the above expression as $f(x)\approx h_{n}(x)={\frac {\color {darkgreen}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{j-1},c_{j}]\}}{{\color {darkgreen}n}(c_{j}-c_{j-1})}},\quad x\in (c_{j-1},c_{j}]{\text{ and }}j=1,2,\dotsc ,i$ in which $h_{n}(x)$ is called the relative frequency histogram.

Since there are many possible ways to construct the class intervals, the value of $h_{n}(x)$ can differ even with the same $n$ and $x$ . When $n$ is large and the length of each class interval is small, we will expect $h_{n}(x)$ to be a good estimate of $f(x)$ (the theoretical pdf).

There are some properties related to the relative frequency histogram, as follows:

Proposition. (Properties of relative frequency histogram)

(i) $h_{n}(x)\geq 0$ ;

(ii) The total area bounded by $h_{n}(x)$ and the $x$ -axis is one, i.e. $\int _{c_{0}}^{c_{i}}h_{n}(x)\,dx=1$ ^[2];

(iii) The probability of an event $A$ that is a union of some class intervals is $\mathbb {P} (A)\approx \int _{A}^{}h_{n}(x)\,dx$ .

Proof.

(i) Since the indicator function is nonnegative (its value is either 0 or 1), $n$ is positive, and $c_{j}>c_{j-1}$ so $c_{j}-c_{j-1}$ is positive, we have $h_{n}(x)\geq 0$ by definition.

(ii) ${\begin{aligned}\int _{c_{0}}^{c_{i}}h_{n}(x)\,dx&=\int _{c_{0}}^{c_{1}}h_{n}(x)\,dx+\int _{c_{1}}^{c_{2}}h_{n}(x)\,dx+\dotsb +\int _{c_{i-1}}^{c_{i}}h_{n}(x)\,dx\\&={\frac {1}{n}}\left(\int _{c_{0}}^{c_{1}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}}{c_{1}-c_{0}}}\,dx+\int _{c_{1}}^{c_{2}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}}{c_{2}-c_{1}}}\,dx+\dotsb +\int _{c_{i-1}}^{c_{i}}{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}}{c_{i}-c_{i-1}}}\,dx\right)\\&={\frac {1}{n}}\left({\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}}{c_{1}-c_{0}}}\cdot (c_{1}-c_{0})+{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}}{c_{2}-c_{1}}}\cdot (c_{2}-c_{1})+\dotsb +{\frac {\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}}{c_{i}-c_{i-1}}}\cdot (c_{i}-c_{i-1})\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\}+\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{1},c_{2}]\}+\dotsb +\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{i-1},c_{i}]\}\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in (c_{0},c_{1}]\cup (c_{1},c_{2}]\cup \dotsb \cup (c_{i-1},c_{i}]\}\right)\\&={\frac {1}{n}}\left(\sum _{k=1}^{n}\mathbf {1} \{X_{k}\in \underbrace {(c_{0},c_{i}]} _{{\text{sample space of }}X}\}\right)\\&={\frac {1}{n}}\cdot \sum _{k=1}^{n}1\\&={\frac {1}{n}}\cdot n\\&=1.\end{aligned}}$ Here, $c_{0}$ is $X_{\text{min}}$ and $c_{i}$ is $X_{\text{max}}$ .

(iii) We can "split" the integral in a similar way as in (ii), and then eventually the integral equals ${\frac {1}{n}}\cdot \sum _{k=1}^{n}\mathbf {1} \{X_{k}\in A\}$ , and it can can approximate $\mathbb {P} (A)$ since it is the relative frequency of occurrences of the event $\{X_{k}\in A\}$ .

$\Box$

Expectation

In this section, we will discuss some results about expectation, which involve some sort of inequalities. Let $a$ and $b$ be constants. Also, let $\Omega$ be the sample space of $X$ .

Proposition. Let $X$ be a discrete or continuous random variable. If $\mathbb {P} (a<X\leq b)=1$ , then $a<\mathbb {E} [X]\leq b$ .

Proof. Assume $\mathbb {P} (a<X\leq B)=1$ .

Case 1: $X$ is discrete.

By definition of expectation, $\mathbb {E} [X]=\sum _{x\in \Omega }^{}xf(x)$ . Then, we have $\sum _{x\in \Omega }^{}af(x)<\sum _{x\in \Omega }^{}xf(x)\leq \sum _{x\in \Omega }^{}bf(x)\Rightarrow a\sum _{x\in \Omega }^{}f(x)<\mathbb {E} [X]\leq b\sum _{x\in \Omega }^{}f(x)\Rightarrow a<\mathbb {E} [X]\leq b$ because of the condition $\mathbb {P} (a<X\leq b)=1$ .

Case 2: $X$ is continuous.

We have similarly $\int _{\Omega }^{}af(x)\,dx<\int _{\Omega }^{}xf(x)\,dx\leq \int _{\Omega }^{}bf(x)\,dx\Rightarrow a<\mathbb {E} [X]\leq b$ because of the condition of $\mathbb {P} (a<X\leq b)=1$ .

$\Box$

Remark.

We can interchange " $<$ " and " $\leq$ " without affecting the result. This can be seen from the proof.

Proposition. (Markov's inequality) Suppose $\mathbb {E} [X]$ is finite. Let $X$ be a continuous nonnegative random variable. Then, for each positive number $a$ , $\mathbb {P} (X\geq a)\leq {\frac {\mathbb {E} [X]}{a}}$ .

Proof. ${\frac {\mathbb {E} [X]}{a}}={\frac {1}{a}}\int _{-\infty }^{\infty }\underbrace {xf(x)} _{\color {darkgreen}\geq 0}\,dx{\color {darkgreen}\geq }\int _{a}^{\infty }xf(x)\,dx{\color {darkgreen}\geq }{\frac {1}{a}}\int _{a}^{\infty }af(x)\,dx=\int _{a}^{\infty }f(x)\,dx=\mathbb {P} (X\geq a),$ as desired.

$\Box$

Corollary. (Chebyshev's inequality) Suppose $\mathbb {E} [X^{2}]$ is finite. Then, for each positive number $a$ , $\mathbb {P} (|X|\geq a)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}.$

Proof. First, observe that $X^{2}$ is a nonnegative random variable. Then, by Markov's inequality, for each (positive) $a'=a^{2}$ , we have $\mathbb {P} (X^{2}\geq a')\leq {\frac {\mathbb {E} [X^{2}]}{a'}}\implies \mathbb {P} (X^{2}\geq a^{2})\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}\implies \mathbb {P} \left({\sqrt {X^{2}}}\geq {\sqrt {a^{2}}}\right)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}\implies \mathbb {P} (|X|\geq a)\leq {\frac {\mathbb {E} [X^{2}]}{a^{2}}}$ , since $a$ is positive.

$\Box$

Proposition. (Jensen's inequality) Let $X$ be a continuous random variable. If $g$ is a convex function, then $g\left(\mathbb {E} [X]\right)\leq \mathbb {E} [g(X)]$ .

Proof. Let $L(x)=a+bx$ be the tangent of the function $g(x)$ at $x=\mathbb {E} [X]$ . Then, since $g$ is convex, we have $g(x)\geq L(x)$ for each $x$ (informally, we can observe this graphically). As a result, we have ${\begin{aligned}&&\int _{\Omega }^{}g(x)f(x)\,dx&\geq \int _{\Omega }^{}L(x)f(x)\,dx\\&\Rightarrow &\mathbb {E} [g(X)]&\geq \mathbb {E} [L(X)]\\&&&=\mathbb {E} [a+bX]\\&&&=a+b\mathbb {E} [X]\\&&&=L(\mathbb {E} [X])\\&&&=g(\mathbb {E} [X])&{\text{since }}L(x){\text{ is tangent of }}g(x){\text{ at }}x=\mathbb {E} [X],\end{aligned}}$ as desired.

$\Box$

Theorem. (Cauchy-Schwarz inequality) Suppose $\mathbb {E} [X^{2}]$ and $\mathbb {E} [Y^{2}]$ are finite. Then, $(\mathbb {E} [XY])^{2}\leq \mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]$

Proof. ${\begin{aligned}0&\leq \mathbb {E} [(X\mathbb {E} [Y^{2}]-Y\mathbb {E} [XY])^{2}]\\&={\color {darkgreen}\mathbb {E} [}X^{2}\underbrace {(\mathbb {E} [Y^{2}])^{2}} _{\text{constant}}-2XY\underbrace {\mathbb {E} [Y^{2}]\mathbb {E} [XY]} _{\text{constant}}+Y^{2}\underbrace {(\mathbb {E} [XY])^{2}} _{\text{constant}}{\color {darkgreen}]}\\&=(\mathbb {E} [Y^{2}])^{2}{\color {darkgreen}\mathbb {E} [}X^{2}{\color {darkgreen}]}-2\mathbb {E} [Y^{2}]\mathbb {E} [XY]{\color {darkgreen}\mathbb {E} [}XY{\color {darkgreen}]}+(\mathbb {E} [XY])^{2}{\color {darkgreen}\mathbb {E} [}Y^{2}{\color {darkgreen}]}\\&=\mathbb {E} [Y^{2}]\left(\mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-2(\mathbb {E} [XY])^{2}+(\mathbb {E} [XY])^{2}\right)\\&=\mathbb {E} [Y^{2}]\left(\mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-(\mathbb {E} [XY])^{2}\right)\\\end{aligned}}$ Since $\mathbb {E} [Y^{2}]\geq 0$ , we must have $\mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]-(\mathbb {E} [XY])^{2}\geq 0\Leftrightarrow (\mathbb {E} [XY])^{2}\leq \mathbb {E} [X^{2}]\mathbb {E} [Y^{2}]$ .

$\Box$

Example. (Covariance inequality) Use the Cauchy-Schwarz inequality for expectations (above theorem) to prove the covariance inequality (it is sometimes simply called the Cauchy-Schwarz inequality): ${\big (}\operatorname {Cov} (X,Y){\big )}^{2}\leq \operatorname {Var} (X)\operatorname {Var} (Y)$ (assuming the existence of the covariance and variances)

Proof. Let $X'=X-\mathbb {E} [X]$ and $Y'=Y-\mathbb {E} [Y]$ . Then, $\mathbb {E} [X']$ and $\mathbb {E} [Y']$ are finite. Hence, by Cauchy-Schwarz inequality, $(\mathbb {E} [X'Y'])^{2}\leq \mathbb {E} [(X')^{2}]\mathbb {E} [(Y')^{2}]\Leftrightarrow (\mathbb {E} [(X-\mathbb {E} [X])(Y-\mathbb {E} [Y])]\leq \mathbb {E} [(X-\mathbb {E} [X])^{2}]\mathbb {E} [(Y-\mathbb {E} [Y])^{2}]{\overset {\text{ def }}{\Leftrightarrow }}{\big (}\operatorname {Cov} (X,Y){\big )}^{2}\leq \operatorname {Var} (X)\operatorname {Var} (Y).$

$\Box$

Convergence

Before discussing convergence, we will define some terms that will be used later.

Definition. (Statistics) Statistics are functions of random sample.

Remark.

The random sample consists of $n$ ( $n$ is sample size) random variables $X_{1},\dotsc ,X_{n}$ .
Two important statistics are the sample mean ${\overline {X}}={\frac {\sum _{i=1}^{n}X_{i}}{n}}$ and the sample variance $S^{2}={\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n}}$ .

In many other places, $S^{2}$ is used to denote ${\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}{n-1}}$ , the unbiased sample variance. In fact, $S^{2}$ here is biased (we will discuss what "(un)biased" means in the next chapter). Warning: we should be careful about this difference in definitions.
Both of ${\overline {X}}$ and $S^{2}$ are random variables, since random variables are involved in them.

In a particular sample, say $x_{1},\dotsc ,x_{n}$ , we observe definite values of their sample mean, ${\overline {x}}={\frac {\sum _{i=1}^{n}x_{i}}{n}}$ , and sample variance, $s^{2}={\frac {\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}}{n}}$ . However, each of the values is only one realization of the respective random variables ${\overline {X}}$ and $S^{2}$ . We should notice the difference between these definite values (not random variables) and the statistics (random variables).

To explain the definitions of the sample mean ${\overline {X}}$ and sample variance $S^{2}$ more intuitively, consider the following.

Recall that the empirical cdf $F_{n}(x)$ assigns probability ${\frac {1}{n}}$ to each of the random sample $X_{1},\dotsc ,X_{n}$ . Thus, by the definition of mean and variance, the mean of a random variable, say $Y$ , with this cdf $F_{n}(x)$ (and hence with the corresponding pmf $f_{n}(x)$ ) is $\sum _{i=1}^{n}\left(X_{i}\cdot {\frac {1}{n}}\right)={\overline {X}}$ . Similarly, the variance of $Y$ is $\sum _{i=1}^{n}\left((X_{i}-{\overline {X}})^{2}\cdot {\frac {1}{n}}\right)=S^{2}$ . In other words, the mean and variance of the empirical distribution, which corresponds to the random sample, is the sample mean ${\overline {X}}$ and the sample variance $S^{2}$ respectively, which is quite natural, right?

Remark.

Here, we use " $X_{i}$ " rather than the usual " $x_{i}$ " in the expression, and the mean and variance are also random variables. This is because the sample space of the empirical cdf consists of random variables $X_{1},\dotsc ,X_{n}$ , rather than definite values $x_{1},\dotsc ,x_{n}$ .

Also, recall that the empirical cdf $F_{n}(x)$ can well approximate the cdf of $X$ , $F(x)$ when $n$ is large. Since ${\overline {X}}$ and $S^{2}$ are the mean and variance of a random variable with cdf $F_{n}(x)$ it is natural to expect that ${\overline {X}}$ and $S^{2}$ can well approximate the mean and variance of $X$ .

Convergence in probability

Definition. (Convergence in probability) Let $Z_{1},Z_{2},\dotsc$ be a sequence of random variables. The sequence converges in probability to a random variable $Z$ , if for each $\varepsilon >0$ , $\mathbb {P} (|Z_{n}-Z|>\varepsilon )\to 0$ as $n\to \infty$ . If this is the case, we write this as $Z_{n}\;{\overset {p}{\to }}\;Z$ for simplicity.

Remark.

We may compare this definition with the definition of convergence of a deterministic sequence $(a_{n}:n\in \mathbb {N}$ ):

a_{n}\to a

as

n\to \infty

if for each

\varepsilon >0

, there exists an integer

N>0

(which is a function of

\varepsilon

), such that when

n\geq N

,

|a_{n}-a|<\varepsilon

(surely).

For comparison, we may rewrite the above definition as

Z_{n}\;{\overset {p}{\to }}\;Z

as

n\to \infty

if for each

\varepsilon >0

, there exists an integer

N>0

(which is a function of

\varepsilon

), such that when

n\geq N

, the probability for

|Z_{n}-Z|<\varepsilon

is very close to one (but this event does not happen surely).

$\varepsilon$ specifies the accuracy of the convergence. If higher accuracy is desired, then $\varepsilon$ will be set to be a smaller (positive) value. The probability in the definition is very close to zero (we say that the convergence with a certain accuracy (depending on the value of $\varepsilon$ ) is "achieved" in this case) when $n$ is sufficiently large.

The following theorem, namely weak law of large number, is an important theorem which is related to convergence in probability.

Theorem. (Weak law of large number (Weak LLN)) Let $X_{1},\dotsc ,X_{n}$ be a sequence of independent random variables with the same finite mean $\mu$ and same finite variance $\sigma ^{2}$ . Then, as $n\to \infty$ , ${\overline {X}}\;{\overset {p}{\to }}\;\mu$ .

Proof. We use $S_{n}$ to denote $\sum _{i=1}^{n}X_{i}$ .

By definition, ${\overline {X}}\;{\overset {p}{\to }}\;\mu$ as $n\to \infty$ is equivalent to $\mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\varepsilon \right)\to 0$ as $n\to \infty$ .

By Chebyshov's inequality, we have ${\begin{aligned}\mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\epsilon \right)&\leq {\frac {1}{\varepsilon ^{2}}}\mathbb {E} \left[\left({\frac {S_{n}}{n}}-\mu \right)^{2}\right]\\&={\frac {1}{\varepsilon ^{2}}}\mathbb {E} \left[\left({\frac {S_{n}-n\mu }{\color {darkgreen}n}}\right)^{2}\right]\\&={\frac {1}{{\color {darkgreen}n^{2}}\varepsilon ^{2}}}\mathbb {E} \left[\left(S_{n}-n\mu \right)^{2}\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\mathbb {E} \left[\left(\sum _{i=1}^{n}X_{i}-\mu \right)^{2}\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\right)\\\end{aligned}}$

Since $X_{1},X_{2},\dotsc$ are independent (and hence functions of them are also independent) and the expectation is multiplicative under independence, ${\begin{aligned}{\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\mathbb {E} \left[\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}(X_{i}-\mu )(X_{j}-\mu )\right]\right)&={\frac {1}{n^{2}\varepsilon ^{2}}}\left(\mathbb {E} \left[\sum _{i=j=1}^{n}(X_{i}-\mu )^{2}\right]+\sum _{i=1}^{n}\sum _{j\neq i,j=1}^{n}\underbrace {\mathbb {E} [X_{i}-\mu ]} _{=\mu -\mu =0}\underbrace {\mathbb {E} [X_{j}-\mu ]} _{=\mu -\mu =0}\right)\\&={\frac {1}{n^{2}\varepsilon ^{2}}}\cdot \sum _{i=1}^{n}\underbrace {\mathbb {E} \left[(X_{i}-\mu )^{2}\right]} _{=\sigma ^{2}}\\&={\frac {n\sigma ^{2}}{n^{2}\varepsilon ^{2}}}\\&={\frac {\sigma ^{2}}{n\varepsilon ^{2}}}\\&\to 0&{\text{as }}n\to \infty .\end{aligned}}$ So, the probability $\mathbb {P} \left(\left|{\frac {S_{n}}{n}}-\mu \right|>\varepsilon \right)$ is less than or equal to an expression that tends to be 0 as $n\to \infty$ . Since the probability is nonnegative ( $\geq 0$ ), it follows that the probability also tends to be 0 as $n\to \infty$ .

$\Box$

Remark.

There is also strong law of large number, which is related to almost sure convergence (which is stronger than probability convergence, i.e. implies probability convergence).

There are also some properties of convergence in probability that help us to determine a complex expression converges to what thing.

Proposition. (Properties of convergence in probability) If $X_{n}\;{\overset {p}{\to }}\;X$ and $Y_{n}\;{\overset {p}{\to }}\;Y$ , then

(linearity) $aX_{n}+bY_{n}\;{\overset {p}{\to }}\;aX+bY$ where $a,b$ are constants;
(multiplicativity) $X_{n}Y_{n}\;{\overset {p}{\to }}\;XY$ ;
$X_{n}/Y_{n}\;{\overset {p}{\to }}\;X/Y$ given that $Y_{n}\neq 0$ and $Y\neq 0$ ;
(continuous mapping theorem) if $g$ is a continuous function, then $g(X_{n})\;{\overset {p}{\to }}\;g(X)$ (and $g(Y_{n})\;{\overset {p}{\to }}\;g(Y)$ )

Proof. Brief idea: Assume $X_{n}\;{\overset {p}{\to }}\;X$ and $Y_{n}\;{\overset {p}{\to }}\;Y$ . Continuous mapping theorem is first proven so that we can use it in the proof of other properties (the proof is omitted here). Also, it can be shown that $(X_{n},Y_{n})\;{\overset {p}{\to }}\;(X,Y)$ (joint convergence in probability, the definition is similar, except that the random variables become ordered pairs, so the interpretation of " $|Z_{n}-Z|$ " becomes the distance between the two points in Cartesian coordinate system, which are represented by the ordered pairs)

After that we define $g(z_{1},z_{2})=az_{1}+bz_{2}$ , $g(z_{1},z_{2})=z_{1}z_{2}$ , and $g(z_{1}/z_{2})=z_{1}/z_{2}$ respectively, where each of these functions is continuous, and $a,b$ are constants. Then, applying the continuous mapping theorem using each of these functions gives us the first three results.

$\Box$

Convergence in distribution

Definition. (Convergence in distribution) Let $Z_{1},Z_{2},\dotsc$ be a sequence of random variables. The sequence converges in distribution to a random variable $Z$ if as $n\to \infty$ , $G_{n}(x)\to G(x)$ for each $x$ at which $G(x)$ is continuous, where $G_{n}(x)$ and $G(x)$ are the cdf of $Z_{n}$ and $Z$ respectively. If this is the case, we write this as $Z_{n}\;{\overset {d}{\to }}\;Z$ for simplicity.

Remark.

The requirement for $G(x)$ to be continuous is added so that the convergence in distribution still holds even if the convergence of cdf's fails at some points at which $G(x)$ is discontinuous.
We may alternatively express the definition as $\lim _{n\to \infty }G_{n}(x)=G(x)$ which has the same meaning as $G_{n}(x)\to G(x)$ as $n\to \infty$ .
It can be shown that convergence in probability implies convergence in distribution. That is, if $X_{n}\;{\overset {p}{\to }}\;X$ , then $X_{n}\;{\overset {d}{\to }}\;X$ , but the converse is true only when the limiting " $X$ " is a constant, i.e. if $X_{n}\;{\overset {d}{\to }}\;c$ , then $X_{n}\;{\overset {p}{\to }}\;c$ where $c$ is a constant.

A very important theorem in statistics which is related to convergence in distribution is central limit theorem.

Theorem. (Central limit theorem (CLT)) Let $X_{1},X_{2},\dotsc$ be a sequence of independent random variables with the same finite mean $\mu$ and variance $\sigma ^{2}$ . Then, as $n\to \infty$ , ${\frac {{\overline {X}}-\mathbb {E} [{\overline {X}}]}{\sqrt {\operatorname {Var} ({\overline {X}})}}}={\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}\;{\overset {d}{\to }}\;Z$ , in which $Z$ follows the standard normal distribution, ${\mathcal {N}}(0,1)$ .

Proof. A (lengthy) proof can be founded in Probability/Transformation of Random Variables#Central limit theorem.

$\Box$

There are some properties of convergence in distribution, but they are a bit different from the properties of convergence in probability. These properties are given by Slutsky's theorem, and also continuous mapping theorem.

Theorem. (Continuous mapping theorem) If $X_{n}\;{\overset {d}{\to }}\;X$ , then $g(X_{n})\;{\overset {d}{\to }}\;g(X)$ given that $g$ is a continuous function.

Proof. Omitted.

$\Box$

Theorem. (Slutsky's theorem) If $X_{n}\;{\overset {d}{\to }}\;X$ and $Y_{n}\;{\overset {p}{\to }}\;c$ where $c$ is a constant, then

$X_{n}+Y_{n}\;{\overset {d}{\to }}\;X+c$ ;
$X_{n}Y_{n}\;{\overset {d}{\to }}\;cX$ ;
$X_{n}/Y_{n}\;{\overset {d}{\to }}\;X/c$ given that $c\neq 0$ .

Proof. Brief idea: Assume $X_{n}\;{\overset {d}{\to }}\;X$ and $Y_{n}\;{\overset {p}{\to }}\;c$ . Then, it can be shown that $(X_{n},Y_{n})\;{\overset {d}{\to }}\;(X,c)$ (joint convergence in distribution, and the definitions of this is similar, except that the cdf's become joint cdf's of ordered pairs). After that, we define $g(z_{1},z_{2})=z_{1}+z_{2}$ , $g(z_{1},z_{2})=z_{1}z_{2}$ , and $g(z_{1},z_{2})=z_{1}/z_{2}$ respectively, where each of the functions is continuous, and then applying the continuous mapping theorem using each of these functions gives us the three desired results.

$\Box$

Remark.

Notice that the assumption mentions that $Y_{n}\;{\overset {\color {darkgreen}p}{\to }}\;c$ but not $Y_{n}\;{\overset {\color {darkgreen}d}{\to }}\;c$ .

Resampling

By resampling, we mean creating new samples based on an existing sample. Now, let us consider the following for a general overview of the procedure of resampling.

Suppose $X_{1},\dotsc ,X_{n}$ is a random sample from a distribution of a random variable $X$ with cdf, $F(x)$ . Let $x_{1},\dotsc ,x_{n}$ be a corresponding realization of the random sample $X_{1},\dotsc ,X_{n}$ . Based on this realization, we have also a realization of the empirical cdf: ${\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{x_{k}\leq x\}$ ^[3]. Since this is a realization of the empirical cdf, by Glivenko-Cantelli theorem, it is a good estimate of the cdf $F(x)$ when $n$ is large ^[4]. In other words, if we denote the random variable with the same pdf as that realization of the empirical cdf by $X^{*}$ , $X^{*}$ and $X$ have similar distributions when $n$ is large.

Notice that a realization of empirical cdf is a discrete cdf (since the support $x_{1},\dotsc ,x_{n}$ is countable). We now draw a random sample (called the bootstrap (or resampling) random sample) with sample size $B$ (called the bootstrap sample size) $X_{1}^{*},\dotsc ,X_{B}^{*}$ from the distribution of a random variable $X^{*}$ ( $X^{*}$ comes from sampling from $X$ , so the behaviour of sampling from $X^{*}$ is called resampling).

Then, the relative frequency historgram of $X_{1}^{*},\dotsc ,X_{B}^{*}$ should be close to that of the corresponding realization of the empirical pmf of $X^{*}$ (found from the realization of the empirical cdf of $X^{*}$ ), which is close to pdf $f(x)$ of $X$ . This means the relative frequency historgram of $X_{1}^{*},\dotsc ,X_{B}^{*}$ is close to the pdf $f(x)$ of $X$ .

In particular, since the cdf of $X^{*}$ , $F_{n}(x)$ , assigns probability $1/n$ to each of $X_{1}^{*},\dotsc ,X_{B}^{*}$ ^[5], the pmf of $X^{*}$ is $\mathbb {P} (X^{*}=x_{i})={\frac {1}{n}},\quad i=1,2,\dotsc ,n$ . Notice that this pmf is quite simple, and therefore it can make the related calculation about it simpler. For example, in the following, we want to know the distribution of $T^{*}=g(X_{1}^{*},\dotsc ,X_{n}^{*})$ , and this simple pmf can make the resulting distribution also quite simple.

Remark. For the things involved in the bootstrap method ("bootstrapped" things), there is usually an additional "*" in each of their notations.

In the following, we will discuss an application of the bootstrap method (or resampling) mentioned above, namely using bootstrap method to approximate the distribution of a statistic $T=g(X_{1},X_{2},\dotsc ,X_{n})$ (the inputs of the functions are random variables and $g$ is a function). The reason for approximating, rather than finding the distribution exactly, is that the latter is usually infeasible (or may be too complicated).

To do this, consider the "bootstrapped statistic" $T^{*}=g(X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*})$ and the statistic $T=g(X_{1},X_{2},\dotsc ,X_{n})$ . $X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*}$ is the bootstrap random sample (with bootstrap sample size $n$ ) from the distribution of $X^{*}$ and $X_{1},X_{2},\dotsc ,X_{n}$ is the random sample from the distribution of $X^{*}$ . When $n$ is large, since the distribution of $X^{*}$ is similar to that of $X$ , the bootstrap random sample $X_{1}^{*},X_{2}^{*},\dotsc ,X_{B}^{*}$ and the random sample $X_{1},X_{2},\dotsc ,X_{n}$ are also similar. It follows that $T^{*}$ and $T$ are similar as well, or to be more precise, the distributions of $T^{*}$ and $T$ are close. As a result, we can utilize the distribution of $T^{*}$ (which is easier to find and simpler, since the pmf of $X^{*}$ is simple as in above) to approximate the distribution of $T$ . A procedure to do this is as follows:

Generate a bootstrapped realization $x_{1}^{*},x_{2}^{*},\dotsc ,x_{n}^{*}$ from the bootstrap random sample $X_{1}^{*},X_{2}^{*},\dotsc ,X_{n}^{*}$ , which is from the distribution of $X^{*}$ .
Calculate a realization of the bootstrapped statistic $T^{*}$ , $t^{*}=g(x_{1}^{*},x_{2}^{*},\dotsc ,x_{n}^{*})$ .
Repeat 1. to 2. $j$ times to get a sequence of $j$ realizations of $T^{*}$ : $t_{1}^{*},t_{2}^{*},\dotsc ,t_{j}^{*}$ .
Plot the relative frequency historgram of the $j$ realizations $t_{1}^{*},t_{2}^{*},\dotsc ,t_{j}^{*}$ .

This histogram of the $j$ realizations (which are a realization of a random sample from $T^{*}$ with sample size $j$ ) is close to the pmf of $T^{*}$ ^[6], and thus close to the pmf of $T$ .

Testing Statistical Hypothesis

Statistics
Preliminaries

Point Estimation

↑ Intuitively, given a candidate for the maximum, we can always add "a little bit" to it to get a greater candidate. So, there is no "greatest" element in the set.
↑ This is because $X_{\text{min}}=c_{0}$ and $X_{\text{max}}=c_{i}$ .
↑ This is different from the empirical cdf ${\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\leq x\}$ .
↑ For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf $F(x)$ , regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf $F(x)$ , when $n$ is large.
↑ That is, for a realization of random sample $X_{1},X_{2},\dotsc ,X_{n}$ , say $x_{1},x_{2},\dotsc ,x_{n}$ , the probability for $X^{*}$ to equal $x_{1},x_{2},\dotsc ,x_{n}$ (which corresponds to the realization of $X_{1},X_{2},\dotsc ,X_{n}$ respectively), is $1/n$ each.
↑ The reason is mentioned similarly above: the histogram should be close to the pmf of $T^{*}$ since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample $T_{1}^{*},T_{2}^{*},\dotsc ,T_{j}^{*}$ ) is close to the cdf of $T^{*}$

[1] Intuitively, given a candidate for the maximum, we can always add "a little bit" to it to get a greater candidate. So, there is no "greatest" element in the set.

[2] This is because $X_{\text{min}}=c_{0}$ and $X_{\text{max}}=c_{i}$ .

[3] This is different from the empirical cdf ${\frac {1}{n}}\sum _{k=1}^{n}\mathbf {1} \{X_{k}\leq x\}$ .

[4] For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf $F(x)$ , regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf $F(x)$ , when $n$ is large.

[5] That is, for a realization of random sample $X_{1},X_{2},\dotsc ,X_{n}$ , say $x_{1},x_{2},\dotsc ,x_{n}$ , the probability for $X^{*}$ to equal $x_{1},x_{2},\dotsc ,x_{n}$ (which corresponds to the realization of $X_{1},X_{2},\dotsc ,X_{n}$ respectively), is $1/n$ each.

[6] The reason is mentioned similarly above: the histogram should be close to the pmf of $T^{*}$ since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample $T_{1}^{*},T_{2}^{*},\dotsc ,T_{j}^{*}$ ) is close to the cdf of $T^{*}$

[1]

[2]

[3]

[4]

[5]

[6]