# Statistics/Preliminaries

This chapter discusses some preliminary knowledge (related to statistics) for the following chapters in the advanced part.

## Empirical distribution

edit **Definition.**
(Random sample)
Suppose is a random variable resulting from a random experiment, with a certain distribution.
After repeating this random experiment *independent* times, we obtain *independent and identically distributed* (iid) random variables, denoted by , associated with the outcomes.
They are called a *random sample* from the distribution with sample size .

**Remark.**

- We usually refer the underlying distribution as a
*population*. - Often,
*computer*is useful for conducting such experiment and repeating it many times. - In particular, a programming language, called R, is commonly used for computational statistics. You may see the wikibook R Programming for more details about it.
- As a result, the content discussed in this section (as well as the section about resampling) is quite relevant to computational statistics.

Since all these random variables follow the same cdf as , we may expect their distribution should be somewhat similar to the distribution of , and indeed, this is true. Before showing how this is true, we need to define "the distribution of these random variables" more precisely, as follows:

**Definition.**
(Empirical distribution)
The cdf of *empirical distribution*, empirical cdf, of a random sample , denoted by , is .

**Remark.**

- is the indicator function with value 1 if is true and 0 otherwise.
- We can see that "assigns" the probability (or "mass") to each of , and this is indeed a valid cdf.

- This is because for each of , if it is less than or equal to , then the corresponding indicator function in the sum is one, and thus a value of " " is contributed to the cdf.
- To understand this more clearly, consider the following example.

- We can interpret as the
*relative frequency*of the event . Recall that the frequentist definition of probability of an event is the "long-term" relative frequency of the event (i.e. the relative frequency of the event after repeating a random experiment*infinite*number of times). As a result, we will intuitively expect that when is large.

**Example.**
A random sample of size 5 is taken from an unknown distribution, and the following numbers are obtained:

(a) Find the empirical cdf.

(b) Let be a (discrete) random variable with cdf exactly the same as the empirical cdf in (a). Prove that the pmf of (called *empirical pmf*) is
*Solution*:

(a) First, we order the sample data ascendingly so that we can find the empirical cdf more conveniently:

The empirical cdf is given by
*Explanations*:

- After ordering the sample data, we treat each of the number as an observed value of the random sample: .
- Then, when , none of is less than or equal to . So, all indicator functions involved are zero, and thus the value of the empirical cdf is zero.
- When , only , and thus only the indicator function in this case, and all other indicator functions are zero. As a result, the value is .
- Similarly, when , only , and thus only the indicator function and in this case, and all other indicator functions are zero. As a result, the value is .
- ...
- When , all . Hence, all indicator functions are one, and thus the value of the empirical cdf is .

(b)

**Proof.**
First, notice that the cdf of is .

Then, we observe that when , , and (from the empirical cdf). Hence, in this case. Similarly, when , , and . Thus, also in this case. With similar arguments, we can show that also when .

**Remark.**

- Observe from (b) that the support of contains exactly the numbers in the sample data, which are the realization of the random sample . This shows that the probability is "assigned" to each of .

**Theorem.**
(Glivenko–Cantelli theorem)
As , *almost surely* (a.s.).

**Remark.**

- stands for supremum of a set (with some technical requirements), which means the least upper bound of the set, which is the least element that is greater or equal to each other element in the set.

- The meaning of is the least upper bound of the set containing the values of over .
- The supremum is similar to the concept of maximum (indeed, if maximum exists, then maximum is the same as supremum), but a difference between them is that sometimes supremum exists while maximum does not exist.
- For instance the supremum of the set (or interval) is 1 (intuitively). However, the maximum of the set (i.e. the greatest element in the set) does not exist (notice that 1 is not included in this set)
^{[1]}.

- The term "almost surely" means that this happens with probability 1. The details for the reason of calling this "almost surely" instead of "surely" involves some understanding of measure theory, and so is omitted here.
- Roughly speaking, from this theorem, we know that is a good estimate of , and an even better estimate of (or "closer to") when is large, for every realization (each of them is real number), since the least upper bound of the absolute difference already tends to zero, and then we will intuitively expect that
*every*such absolute difference also tends to zero. - This theorem is sometimes referred as the
*fundamental theorem of statistics*, indicating its importance in statistics.

We have mentioned how we can approximate the cdf, and now we would like to estimate the *pdf*/*pmf*.
Let us first discuss how to estimate the pmf.

For the discrete random variable , from the empirical cdf, we know that each is "assigned" with the probability . Also, considering the previous example, the empirical pmf is .

**Remark.**

- The empirical pmf shows the relative frequency of occurrences of , and therefore can approximate the probability of occurrences of , which is the long-term relative frequency of occurrences of .

To discuss the estimation of pdf of continuous random variable, we need to define *class intervals* first.

**Definition.**
(Class intervals)
First, choose an integer , and a sequence of real numbers such that .
Then, the *class intervals* are .

For the continuous random variable , construct class intervals for which are a non-overlapped partition of the interval , in which and are the minimum and maximum values in the sample. Then, the pdf when and are close, i.e. the length of each class interval is small. (Although the union of the above class intervals is and thus the value is not included in the interval, it does not matter since the value of the pdf at does not affect the calculation of probability.) Here, is and is .

Since
is the relative frequency of occurrences of the event ,
we can rewrite the above expression as
in which is called the *relative frequency histogram*.

Since there are many possible ways to construct the class intervals, the value of can differ even with the same and .
When is *large* and the length of each class interval is *small*, we will expect to be a good estimate of (the theoretical pdf).

There are some properties related to the relative frequency histogram, as follows:

**Proposition.**
(Properties of relative frequency histogram)

(i) ;

(ii) The total area bounded by and the -axis is one, i.e. ^{[2]};

(iii) The probability of an event that is a union of some class intervals is .

**Proof.**

(i) Since the indicator function is nonnegative (its value is either 0 or 1), is positive, and so is positive, we have by definition.

(ii) Here, is and is .

(iii) We can "split" the integral in a similar way as in (ii), and then eventually the integral equals , and it can can approximate since it is the relative frequency of occurrences of the event .

## Expectation

editIn this section, we will discuss some results about expectation, which involve some sort of inequalities. Let and be constants. Also, let be the sample space of .

**Proposition.**
Let be a discrete or continuous random variable.
If , then .

**Proof.**
Assume .

**Case 1**: is discrete.

By definition of expectation, . Then, we have because of the condition .

**Case 2**: is continuous.

We have similarly because of the condition of .

**Remark.**

- We can interchange " " and " " without affecting the result. This can be seen from the proof.

**Proposition.**
(Markov's inequality)
Suppose is finite.
Let be a continuous nonnegative random variable.
Then, for each positive number , .

**Proof.**
as desired.

**Corollary.**
(Chebyshev's inequality)
Suppose is finite.
Then, for each positive number ,

**Proof.**
First, observe that is a nonnegative random variable. Then, by Markov's inequality, for each (positive) , we have , since is positive.

**Proposition.**
(Jensen's inequality)
Let be a continuous random variable.
If is a convex function, then .

**Proof.**
Let be the tangent of the function at .
Then, since is convex, we have for each (informally, we can observe this graphically).
As a result, we have
as desired.

**Theorem.**
(Cauchy-Schwarz inequality)
Suppose and are finite. Then,

**Proof.**
Since , we must have .

**Example.**
(Covariance inequality)
Use the Cauchy-Schwarz inequality for expectations (above theorem) to prove the *covariance inequality* (it is sometimes simply called the Cauchy-Schwarz inequality):
(assuming the existence of the covariance and variances)

**Proof.**
Let and .
Then, and are finite.
Hence, by Cauchy-Schwarz inequality,

## Convergence

editBefore discussing convergence, we will define some terms that will be used later.

**Definition.**
(Statistics)
*Statistics* are functions of random sample.

**Remark.**

- The random sample consists of ( is sample size) random variables .
- Two
*important statistics*are the*sample mean*and the*sample variance*.

- In many other places, is used to denote , the
*unbiased*sample variance. In fact, here is*biased*(we will discuss what "(un)biased" means in the next chapter). Warning: we should be careful about this difference in definitions. - Both of and are random variables, since random variables are involved in them.

- In many other places, is used to denote , the

In a *particular sample*, say , we observe *definite* values of their sample mean, , and sample variance, .
*However*, each of the values is only *one* realization of the respective random variables and .
We should notice the difference between these definite values (not random variables) and the statistics (random variables).

To explain the definitions of the sample mean and sample variance more intuitively, consider the following.

Recall that the empirical cdf assigns probability to each of the random sample .
Thus, by the definition of mean and variance, the *mean* of a random variable, say , with this cdf (and hence with the corresponding pmf ) is . Similarly, the *variance* of is .
In other words, the *mean* and *variance* of the empirical distribution, which corresponds to the *random sample*, is the *sample mean* and the *sample variance* respectively,
which is quite natural, right?

**Remark.**

- Here, we use " " rather than the usual " " in the expression, and the mean and variance are also random variables. This is because the sample space of the empirical cdf consists of random variables , rather than definite values .

Also, recall that the empirical cdf can well approximate the cdf of , when is large. Since and are the mean and variance of a random variable with cdf it is natural to expect that and can well approximate the mean and variance of .

### Convergence in probability

edit **Definition.**
(Convergence in probability)
Let be a *sequence* of random variables. The sequence *converges in probability* to a random variable , if for each ,
as .
If this is the case, we write this as for simplicity.

**Remark.**

- We may compare this definition with the definition of convergence of a
*deterministic*sequence ):

- as if for each , there exists an integer (which is a function of ), such that when , (surely).

- For comparison, we may rewrite the above definition as

- as if for each , there exists an integer (which is a function of ), such that when , the probability for is very close to one (but this event does not happen
*surely*).

- specifies the
*accuracy*of the convergence. If higher accuracy is desired, then will be set to be a smaller (positive) value. The probability in the definition is very close to zero (we say that the convergence with a certain accuracy (depending on the value of ) is "achieved" in this case) when is sufficiently large.

The following theorem, namely *weak law of large number*, is an important theorem which is related to convergence in probability.

**Theorem.**
(Weak law of large number (Weak LLN))
Let be a *sequence* of *independent* random variables with the same finite mean and same finite variance .
Then, as , .

**Proof.**
We use to denote .

By definition, as is equivalent to as .

By Chebyshov's inequality, we have

Since are *independent* (and hence functions of them are also independent) and the expectation is multiplicative under independence,
So, the probability is *less than or equal to* an expression that tends to be 0 as .
Since the probability is nonnegative ( ), it follows that the probability also tends to be 0 as .

**Remark.**

- There is also
*strong law of large number*, which is related to*almost sure convergence*(which is stronger than probability convergence, i.e. implies probability convergence).

There are also some properties of convergence in probability that help us to determine a complex expression converges to what thing.

**Proposition.**
(Properties of convergence in probability)
If and , then

- (linearity) where are constants;
- (multiplicativity) ;
- given that and ;
- (continuous mapping theorem) if is a continuous function, then (and )

**Proof.**
*Brief idea*:
Assume and .
Continuous mapping theorem is first proven so that we can use it in the proof of other properties (the proof is omitted here).
Also, it can be shown that (joint convergence in probability, the definition is similar, except that the random variables become ordered pairs, so the interpretation of " " becomes the *distance* between the two points in Cartesian coordinate system, which are represented by the ordered pairs)

After that we define , , and respectively, where each of these functions is continuous, and are constants. Then, applying the continuous mapping theorem using each of these functions gives us the first three results.

### Convergence in distribution

edit **Definition.**
(Convergence in distribution)
Let be a *sequence* of random variables. The sequence *converges in distribution* to a random variable if as , for each at which is continuous, where and are the cdf of and respectively.
If this is the case, we write this as for simplicity.

**Remark.**

- The requirement for to be continuous is added so that the convergence in distribution still holds even if the convergence of cdf's fails at some points at which is discontinuous.
- We may alternatively express the definition as which has the same meaning as as .
- It can be shown that convergence in probability implies convergence in distribution. That is, if , then , but the converse is true only when the limiting " " is a constant, i.e. if , then where is a constant.

A very important theorem in statistics which is related to convergence in distribution is *central limit theorem*.

**Theorem.**
(Central limit theorem (CLT))
Let be a sequence of independent random variables with the same finite mean and variance .
Then, as , , in which follows the standard normal distribution, .

**Proof.**
A (lengthy) proof can be founded in Probability/Transformation of Random Variables#Central limit theorem.

There are some properties of convergence in distribution, but they are a bit different from the properties of convergence in probability.
These properties are given by *Slutsky's theorem*, and also continuous mapping theorem.

**Theorem.**
(Continuous mapping theorem)
If , then
given that is a continuous function.

**Proof.**
Omitted.

**Theorem.**
(Slutsky's theorem)
If and where is a constant, then

- ;
- ;
- given that .

**Proof.**
*Brief idea*:
Assume and .
Then, it can be shown that (joint convergence in distribution, and the definitions of this is similar, except that the cdf's become joint cdf's of ordered pairs).
After that, we define , , and respectively, where each of the functions is continuous,
and then applying the continuous mapping theorem using each of these functions gives us the three desired results.

**Remark.**

- Notice that the assumption mentions that but not .

## Resampling

editBy *resampling*, we mean creating new samples based on an existing sample. Now, let us consider the following for a general overview of the procedure of resampling.

Suppose is a *random sample* from a distribution of a random variable with cdf, .
Let be a corresponding *realization* of the random sample .
Based on this realization, we have also a *realization* of the empirical cdf:
^{[3]}.
Since this is a realization of the empirical cdf, by Glivenko-Cantelli theorem, it is a good estimate of the cdf when is large ^{[4]}. In other words, if we denote the random variable with the same pdf as that *realization* of the empirical cdf by , and have similar distributions when is large.

Notice that a realization of empirical cdf is a *discrete* cdf (since the support is countable).
We now draw a *random sample* (called the bootstrap (or resampling) random sample) with sample size (called the *bootstrap sample size*) from the distribution of a random variable
( comes from *sampling* from , so the behaviour of sampling from is called *resampling*).

Then, the relative frequency historgram of should be close to that of the corresponding *realization* of the empirical pmf of (found from the realization of the empirical cdf of ),
which is close to pdf of .
This means the relative frequency historgram of is close to the pdf of .

In particular, since the cdf of , , assigns probability to each of ^{[5]},
the pmf of is .
Notice that this pmf is quite simple, and therefore it can make the related calculation about it simpler.
For example, in the following, we want to know the distribution of , and this simple pmf can make
the resulting distribution also quite simple.

**Remark.**
For the things involved in the bootstrap method ("bootstrapped" things), there is usually an additional "*" in each of their notations.

In the following, we will discuss an application of the bootstrap method (or *resampling*) mentioned above, namely using bootstrap method to *approximate*
the distribution of a statistic (the inputs of the functions are random variables and is a function).
The reason for approximating, rather than finding the distribution exactly, is that the latter is usually infeasible (or may be too complicated).

To do this, consider the "bootstrapped statistic" and the statistic .
is the bootstrap random sample (with bootstrap sample size ) from the distribution of and is the random sample from the distribution of .
When is large, since the distribution of is similar to that of , the bootstrap random sample and the random sample are also similar.
It follows that and are similar as well, or to be more precise, the *distributions* of and are close.
As a result, we can utilize the distribution of (which is easier to find and simpler, since the pmf of is simple as in above) to approximate the distribution of .
A procedure to do this is as follows:

- Generate a
*bootstrapped realization*from the*bootstrap random sample*, which is from the distribution of . - Calculate a realization of the bootstrapped statistic , .
- Repeat 1. to 2. times to get a sequence of realizations of : .
- Plot the relative frequency historgram of the realizations .

This histogram of the realizations (which are a realization of a random sample from with sample size ) is close to the pmf of ^{[6]}, and thus close to the pmf of .

- ↑ Intuitively, given a candidate for the maximum, we can always add "a little bit" to it to get a greater candidate. So, there is no "greatest" element in the set.
- ↑ This is because and .
- ↑ This is different from the empirical cdf .
- ↑ For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf , regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf , when is large.
- ↑ That is, for a realization of random sample , say , the probability for to equal (which corresponds to the realization of respectively), is each.
- ↑ The reason is mentioned similarly above: the histogram should be close to the pmf of since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample ) is close to the cdf of