Statistics/Preliminaries

This chapter discusses some preliminary knowledge (related to statistics) for the following chapters in the advanced part.

Empirical distribution edit

Definition. (Random sample) Suppose   is a random variable resulting from a random experiment, with a certain distribution. After repeating this random experiment   independent times, we obtain   independent and identically distributed (iid) random variables, denoted by  , associated with the   outcomes. They are called a random sample from the distribution with sample size  .

Remark.

  • We usually refer the underlying distribution as a population.
  • Often, computer is useful for conducting such experiment and repeating it many times.
  • In particular, a programming language, called R, is commonly used for computational statistics. You may see the wikibook R Programming for more details about it.
  • As a result, the content discussed in this section (as well as the section about resampling) is quite relevant to computational statistics.

Since all these   random variables follow the same cdf as  , we may expect their distribution should be somewhat similar to the distribution of  , and indeed, this is true. Before showing how this is true, we need to define "the distribution of these   random variables" more precisely, as follows:

Definition. (Empirical distribution) The cdf of empirical distribution, empirical cdf, of a random sample  , denoted by  , is  .

Remark.

  •   is the indicator function with value 1 if   is true and 0 otherwise.
  • We can see that   "assigns" the probability (or "mass")   to each of  , and this is indeed a valid cdf.
  • This is because for each of  , if it is less than or equal to  , then the corresponding indicator function in the sum is one, and thus a value of " " is contributed to the cdf.
  • To understand this more clearly, consider the following example.
  • We can interpret   as the relative frequency of the event  . Recall that the frequentist definition of probability of an event is the "long-term" relative frequency of the event (i.e. the relative frequency of the event after repeating a random experiment infinite number of times). As a result, we will intuitively expect that   when   is large.

Example. A random sample of size 5 is taken from an unknown distribution, and the following numbers are obtained:

-1.4, 2.3, 0.8, 1.9, -1.6

(a) Find the empirical cdf.

(b) Let   be a (discrete) random variable with cdf exactly the same as the empirical cdf in (a). Prove that the pmf of   (called empirical pmf) is

 
Solution:

(a) First, we order the sample data ascendingly so that we can find the empirical cdf more conveniently:

-1.6, -1.4, 0.8, 1.9, 2.3

The empirical cdf is given by

 
Explanations:
  • After ordering the sample data, we treat each of the number as an observed value of the random sample:  .
  • Then, when  , none of   is less than or equal to  . So, all indicator functions involved are zero, and thus the value of the empirical cdf is zero.
  • When  , only  , and thus only the indicator function   in this case, and all other indicator functions are zero. As a result, the value is  .
  • Similarly, when  , only  , and thus only the indicator function   and   in this case, and all other indicator functions are zero. As a result, the value is  .
  • ...
  • When  , all  . Hence, all indicator functions are one, and thus the value of the empirical cdf is  .

(b)

Proof. First, notice that the cdf of   is  .

Then, we observe that when  ,  , and   (from the empirical cdf). Hence,   in this case. Similarly, when  ,  , and  . Thus,   also in this case. With similar arguments, we can show that   also when  .

 


Remark.

  • Observe from (b) that the support of   contains exactly the numbers in the sample data, which are the realization of the random sample  . This shows that the probability   is "assigned" to each of  .

Theorem. (Glivenko–Cantelli theorem) As  ,   almost surely (a.s.).

Remark.

  •   stands for supremum of a set (with some technical requirements), which means the least upper bound of the set, which is the least element that is greater or equal to each other element in the set.
  • The meaning of   is the least upper bound of the set containing the values of   over  .
  • The supremum is similar to the concept of maximum (indeed, if maximum exists, then maximum is the same as supremum), but a difference between them is that sometimes supremum exists while maximum does not exist.
  • For instance the supremum of the set (or interval)   is 1 (intuitively). However, the maximum of the set   (i.e. the greatest element in the set) does not exist (notice that 1 is not included in this set) [1].
  • The term "almost surely" means that this happens with probability 1. The details for the reason of calling this "almost surely" instead of "surely" involves some understanding of measure theory, and so is omitted here.
  • Roughly speaking, from this theorem, we know that   is a good estimate of  , and an even better estimate of (or "closer to")   when   is large, for every realization   (each of them is real number), since the least upper bound of the absolute difference already tends to zero, and then we will intuitively expect that every such absolute difference also tends to zero.
  • This theorem is sometimes referred as the fundamental theorem of statistics, indicating its importance in statistics.

We have mentioned how we can approximate the cdf, and now we would like to estimate the pdf/pmf. Let us first discuss how to estimate the pmf.

For the discrete random variable  , from the empirical cdf, we know that each   is "assigned" with the probability  . Also, considering the previous example, the empirical pmf is  .

Remark.

  • The empirical pmf   shows the relative frequency of occurrences of  , and therefore can approximate the probability of occurrences of  , which is the long-term relative frequency of occurrences of  .

To discuss the estimation of pdf of continuous random variable, we need to define class intervals first.

Definition. (Class intervals) First, choose an integer  , and a sequence of real numbers   such that  . Then, the class intervals are  .

For the continuous random variable  , construct class intervals for   which are a non-overlapped partition of the interval  , in which   and   are the minimum and maximum values in the sample. Then, the pdf

 
when   and   are close, i.e. the length of each class interval is small. (Although the union of the above class intervals is   and thus the value   is not included in the interval, it does not matter since the value of the pdf at   does not affect the calculation of probability.) Here,   is   and   is  .

Since   is the relative frequency of occurrences of the event  , we can rewrite the above expression as

 
in which   is called the relative frequency histogram.

Since there are many possible ways to construct the class intervals, the value of   can differ even with the same   and  . When   is large and the length of each class interval is small, we will expect   to be a good estimate of   (the theoretical pdf).

There are some properties related to the relative frequency histogram, as follows:

Proposition. (Properties of relative frequency histogram)

(i)  ;

(ii) The total area bounded by   and the  -axis is one, i.e.   [2];

(iii) The probability of an event   that is a union of some class intervals is  .

Proof.

(i) Since the indicator function is nonnegative (its value is either 0 or 1),   is positive, and   so   is positive, we have   by definition.

(ii)

 
Here,   is   and   is  .

(iii) We can "split" the integral in a similar way as in (ii), and then eventually the integral equals  , and it can can approximate   since it is the relative frequency of occurrences of the event  .

 

Expectation edit

In this section, we will discuss some results about expectation, which involve some sort of inequalities. Let   and   be constants. Also, let   be the sample space of  .

Proposition. Let   be a discrete or continuous random variable. If  , then  .

Proof. Assume  .

Case 1:   is discrete.

By definition of expectation,  . Then, we have   because of the condition  .

Case 2:   is continuous.

We have similarly   because of the condition of  .

 

Remark.

  • We can interchange " " and " " without affecting the result. This can be seen from the proof.

Proposition. (Markov's inequality) Suppose   is finite. Let   be a continuous nonnegative random variable. Then, for each positive number  ,  .

Proof.

 
as desired.

 

Corollary. (Chebyshev's inequality) Suppose   is finite. Then, for each positive number  ,  

Proof. First, observe that   is a nonnegative random variable. Then, by Markov's inequality, for each (positive)  , we have

 
, since   is positive.

 

Proposition. (Jensen's inequality) Let   be a continuous random variable. If   is a convex function, then  .

Proof. Let   be the tangent of the function   at  . Then, since   is convex, we have   for each   (informally, we can observe this graphically). As a result, we have

 
as desired.

 

Theorem. (Cauchy-Schwarz inequality) Suppose   and   are finite. Then,

 

Proof.

 
Since  , we must have  .

 

Example. (Covariance inequality) Use the Cauchy-Schwarz inequality for expectations (above theorem) to prove the covariance inequality (it is sometimes simply called the Cauchy-Schwarz inequality):

 
(assuming the existence of the covariance and variances)

Proof. Let   and  . Then,   and   are finite. Hence, by Cauchy-Schwarz inequality,

 

 


Convergence edit

Before discussing convergence, we will define some terms that will be used later.

Definition. (Statistics) Statistics are functions of random sample.

Remark.

  • The random sample consists of   (  is sample size) random variables  .
  • Two important statistics are the sample mean   and the sample variance  .
  • In many other places,   is used to denote  , the unbiased sample variance. In fact,   here is biased (we will discuss what "(un)biased" means in the next chapter). Warning: we should be careful about this difference in definitions.
  • Both of   and   are random variables, since random variables are involved in them.

In a particular sample, say  , we observe definite values of their sample mean,  , and sample variance,  . However, each of the values is only one realization of the respective random variables   and  . We should notice the difference between these definite values (not random variables) and the statistics (random variables).

To explain the definitions of the sample mean   and sample variance   more intuitively, consider the following.

Recall that the empirical cdf   assigns probability   to each of the random sample  . Thus, by the definition of mean and variance, the mean of a random variable, say  , with this cdf   (and hence with the corresponding pmf  ) is  . Similarly, the variance of   is  . In other words, the mean and variance of the empirical distribution, which corresponds to the random sample, is the sample mean   and the sample variance   respectively, which is quite natural, right?

Remark.

  • Here, we use " " rather than the usual " " in the expression, and the mean and variance are also random variables. This is because the sample space of the empirical cdf consists of random variables  , rather than definite values  .

Also, recall that the empirical cdf   can well approximate the cdf of  ,   when   is large. Since   and   are the mean and variance of a random variable with cdf   it is natural to expect that   and   can well approximate the mean and variance of  .

Convergence in probability edit

Definition. (Convergence in probability) Let   be a sequence of random variables. The sequence converges in probability to a random variable  , if for each  ,   as  . If this is the case, we write this as   for simplicity.

Remark.

  • We may compare this definition with the definition of convergence of a deterministic sequence  ):
  as   if for each  , there exists an integer   (which is a function of  ), such that when  ,   (surely).
  • For comparison, we may rewrite the above definition as
  as   if for each  , there exists an integer   (which is a function of  ), such that when   , the probability for   is very close to one (but this event does not happen surely).
  •   specifies the accuracy of the convergence. If higher accuracy is desired, then   will be set to be a smaller (positive) value. The probability in the definition is very close to zero (we say that the convergence with a certain accuracy (depending on the value of  ) is "achieved" in this case) when   is sufficiently large.

The following theorem, namely weak law of large number, is an important theorem which is related to convergence in probability.

Theorem. (Weak law of large number (Weak LLN)) Let   be a sequence of independent random variables with the same finite mean   and same finite variance  . Then, as  ,  .

Proof. We use   to denote  .

By definition,   as   is equivalent to   as  .

By Chebyshov's inequality, we have

 

Since   are independent (and hence functions of them are also independent) and the expectation is multiplicative under independence,

 
So, the probability   is less than or equal to an expression that tends to be 0 as  . Since the probability is nonnegative ( ), it follows that the probability also tends to be 0 as  .

 

Remark.

  • There is also strong law of large number, which is related to almost sure convergence (which is stronger than probability convergence, i.e. implies probability convergence).

There are also some properties of convergence in probability that help us to determine a complex expression converges to what thing.

Proposition. (Properties of convergence in probability) If   and  , then

  • (linearity)   where   are constants;
  • (multiplicativity)  ;
  •   given that   and  ;
  • (continuous mapping theorem) if   is a continuous function, then   (and  )

Proof. Brief idea: Assume   and  . Continuous mapping theorem is first proven so that we can use it in the proof of other properties (the proof is omitted here). Also, it can be shown that   (joint convergence in probability, the definition is similar, except that the random variables become ordered pairs, so the interpretation of " " becomes the distance between the two points in Cartesian coordinate system, which are represented by the ordered pairs)

After that we define  ,  , and   respectively, where each of these functions is continuous, and   are constants. Then, applying the continuous mapping theorem using each of these functions gives us the first three results.

 

Convergence in distribution edit

Definition. (Convergence in distribution) Let   be a sequence of random variables. The sequence converges in distribution to a random variable   if as  ,   for each   at which   is continuous, where   and   are the cdf of   and   respectively. If this is the case, we write this as   for simplicity.

Remark.

  • The requirement for   to be continuous is added so that the convergence in distribution still holds even if the convergence of cdf's fails at some points at which   is discontinuous.
  • We may alternatively express the definition as   which has the same meaning as   as  .
  • It can be shown that convergence in probability implies convergence in distribution. That is, if  , then  , but the converse is true only when the limiting " " is a constant, i.e. if  , then   where   is a constant.

A very important theorem in statistics which is related to convergence in distribution is central limit theorem.

Theorem. (Central limit theorem (CLT)) Let   be a sequence of independent random variables with the same finite mean   and variance  . Then, as  ,  , in which   follows the standard normal distribution,  .

Proof. A (lengthy) proof can be founded in Probability/Transformation of Random Variables#Central limit theorem.

 

There are some properties of convergence in distribution, but they are a bit different from the properties of convergence in probability. These properties are given by Slutsky's theorem, and also continuous mapping theorem.

Theorem. (Continuous mapping theorem) If  , then

 
given that   is a continuous function.

Proof. Omitted.

 

Theorem. (Slutsky's theorem) If   and   where   is a constant, then

  •  ;
  •  ;
  •   given that  .

Proof. Brief idea: Assume   and  . Then, it can be shown that   (joint convergence in distribution, and the definitions of this is similar, except that the cdf's become joint cdf's of ordered pairs). After that, we define  , , and   respectively, where each of the functions is continuous, and then applying the continuous mapping theorem using each of these functions gives us the three desired results.

 

Remark.

  • Notice that the assumption mentions that   but not  .

Resampling edit

By resampling, we mean creating new samples based on an existing sample. Now, let us consider the following for a general overview of the procedure of resampling.

Suppose   is a random sample from a distribution of a random variable   with cdf,  . Let   be a corresponding realization of the random sample  . Based on this realization, we have also a realization of the empirical cdf:   [3]. Since this is a realization of the empirical cdf, by Glivenko-Cantelli theorem, it is a good estimate of the cdf   when   is large [4]. In other words, if we denote the random variable with the same pdf as that realization of the empirical cdf by  ,   and   have similar distributions when   is large.

Notice that a realization of empirical cdf is a discrete cdf (since the support   is countable). We now draw a random sample (called the bootstrap (or resampling) random sample) with sample size   (called the bootstrap sample size)   from the distribution of a random variable   (  comes from sampling from  , so the behaviour of sampling from   is called resampling).

Then, the relative frequency historgram of   should be close to that of the corresponding realization of the empirical pmf of   (found from the realization of the empirical cdf of  ), which is close to pdf   of  . This means the relative frequency historgram of   is close to the pdf   of  .

In particular, since the cdf of  ,  , assigns probability   to each of   [5], the pmf of   is  . Notice that this pmf is quite simple, and therefore it can make the related calculation about it simpler. For example, in the following, we want to know the distribution of  , and this simple pmf can make the resulting distribution also quite simple.

Remark. For the things involved in the bootstrap method ("bootstrapped" things), there is usually an additional "*" in each of their notations.

In the following, we will discuss an application of the bootstrap method (or resampling) mentioned above, namely using bootstrap method to approximate the distribution of a statistic   (the inputs of the functions are random variables and   is a function). The reason for approximating, rather than finding the distribution exactly, is that the latter is usually infeasible (or may be too complicated).

To do this, consider the "bootstrapped statistic"   and the statistic  .   is the bootstrap random sample (with bootstrap sample size  ) from the distribution of   and   is the random sample from the distribution of  . When   is large, since the distribution of   is similar to that of  , the bootstrap random sample   and the random sample   are also similar. It follows that   and   are similar as well, or to be more precise, the distributions of   and   are close. As a result, we can utilize the distribution of   (which is easier to find and simpler, since the pmf of   is simple as in above) to approximate the distribution of  . A procedure to do this is as follows:

  1. Generate a bootstrapped realization   from the bootstrap random sample  , which is from the distribution of  .
  2. Calculate a realization of the bootstrapped statistic  ,  .
  3. Repeat 1. to 2.   times to get a sequence of   realizations of  :  .
  4. Plot the relative frequency historgram of the   realizations  .

This histogram of the   realizations (which are a realization of a random sample from   with sample size  ) is close to the pmf of   [6], and thus close to the pmf of  .

  1. Intuitively, given a candidate for the maximum, we can always add "a little bit" to it to get a greater candidate. So, there is no "greatest" element in the set.
  2. This is because   and  .
  3. This is different from the empirical cdf  .
  4. For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf  , regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf  , when   is large.
  5. That is, for a realization of random sample  , say  , the probability for   to equal   (which corresponds to the realization of   respectively), is   each.
  6. The reason is mentioned similarly above: the histogram should be close to the pmf of   since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample   ) is close to the cdf of