IB Mathematics (HL)/Further Statistics

Topic 8: Option - Statistics and Probability

As learnt in the statistics and probability section in the core syllabus, the expected value $\mathrm {E} (X)$ , or the mean $\mu$ of a distribution "X" is:

$\mu =\mathrm {E} (X)={\begin{cases}\sum xP(x),&{\mbox{for discrete }}X\\\int xf(x)dx,&{\mbox{for continuous }}X\end{cases}}$

The variance of a distribution is defined as:

${\begin{aligned}\sigma ^{2}=\mathrm {Var} (X)&=\mathrm {E} ((X-\mu )^{2})\\&=\mathrm {E} (X^{2})-(\mathrm {E} (X))^{2}\end{aligned}}$

If X is discrete, the variance can be defined as:

${\begin{aligned}\sigma ^{2}=\mathrm {Var} (X)&=\mathrm {E} ((X-\mu )^{2})\\&=\sum (x-\mu )^{2}P(x)\\&=\sum x^{2}P(x)-\sum \mu ^{2}P(x)\\&=\sum x^{2}P(x)-\mu ^{2}&\left\{\sum P(x){\mbox{ is }}1{\mbox{ and }}\mu {\mbox{ is a constant}}\right\}\\\sigma ^{2}=\mathrm {Var} (X)&=\mathrm {E} (X^{2})-(\mathrm {E} (X))^{2}\end{aligned}}$

Expectation Algebra

Linear Transformations

If one displaces and scales a distribution, the mean and the variance must change. The values change in accordance to the following formulas:

$\mathrm {E} (aX+b)=a\mathrm {E} (X)+b$
$\mathrm {Var} (aX+b)=a^{2}\mathrm {Var} (X)$

Notice how the variance is unaffected by the value of b, since the variance of a distribution is never changed by the displacement of the distribution. Only the value a, representing a horizontal stretch of the distribution, modifies the spread.

Linear Combinations of Independent Random Variables

When one takes multiple samples of a random variable X, these are often independent random variables. In that case the distribution must be treated differently.

$\mathrm {E} (a_{1}X_{1}\pm a_{2}X_{2})=a_{1}\mathrm {E} (X_{1})\pm a_{2}\mathrm {E} (X_{2})$
$\mathrm {Var} (a_{1}X_{1}\pm a_{2}X_{2})=a_{1}^{2}\mathrm {Var} (X_{1})+a_{2}^{2}\mathrm {Var} (X_{2})$

Notice that the variance is always superimposed, not subtracted. Also the random variable may not necessarily be from the same population X.

These rules also apply to situations with n independent random variables.

$\mathrm {E} (a_{1}X_{1}\pm a_{2}X_{2}\pm ...\pm a_{n}X_{n})=a_{1}\mathrm {E} (X_{1})\pm a_{2}\mathrm {E} (X_{2})\pm ...\pm a_{n}\mathrm {E} (X_{n})$
$\mathrm {Var} (a_{1}X_{1}\pm a_{2}X_{2}\pm ...\pm a_{n}X_{n})=a_{1}^{2}\mathrm {Var} (X_{1})+a_{2}^{2}\mathrm {Var} (X_{2})+...+a_{n}^{2}\mathrm {Var} (X_{n})$

The derivation of this rule is beyond the scope of the syllabus.

Unbiased estimators of mean and variance

${\begin{aligned}\mathrm {E} ({\overline {X}})&=\mathrm {E} ({\tfrac {X_{1}+X_{2}+...+X_{n}}{n}})&\\&={\tfrac {\mathrm {E} (X_{1}+X_{2}+...+X_{n})}{n}}&\{{\mbox{Assuming independence}}\}\\&={\tfrac {\mathrm {E} (nX)}{n}}&\{n{\mbox{ of them }}\}\\&={\tfrac {n\mu }{n}}\\\mathrm {E} ({\overline {X}})&=\mu &\therefore {\overline {x}}{\mbox{ or }}\mathrm {E} ({\overline {X}}){\mbox{ is the unbiased estimate of }}\mu \end{aligned}}$

Unbiased Estimators of Variance are calculated by multiplying the original variance by n/(n-1)

Probability Distribution Functions

Discrete Distributions

Binomial, B(n,p)
When X~B(n,p), P(X=x) denotes the probability of x number of successes when n trials, each with a probability of success of p, are performed.
Applies when:

There are exactly two possible outcomes
The number of trials is fixed
Each trial is independent of the outcomes of other trials
The probability of each trial remains constant.

Negative Binomial, NB(r,p)
Models the number of Bernoulli trials B(1,p) required to achieve r successes. The combinatorial coefficient in the probability mass function is merely to account for the number of ways such a number of successes could be arranged.

Geometric, Geo(p)
Models the number of Bernoulli trials B(1,p) which will be needed until the first success, ie similar to NB(1,p). No combinatorial coefficient is needed because "counting" stops once the first success has been achieved. Hence there is only one possible arrangement for outcomes.

Poisson, P_o(m)
A Poisson distribution measures the number of successes in

a fixed interval
an infinite number of trials.

For instance, there could potentially be a huge number of phonecalls per hour, but if the mean is two per hour, the chance of this is slim. NB: This assumes that m is constant, which in real situations is not true (eg frequency of phonecalls depends on time of day, day of week, etc). Questions of this type often involve converting between time intervals, for instance if the mean is two calls per hour and the probability of a certain number of calls in 5 hours is needed, the mean used would be 10. Also specific to this distribution is the fact that E(x) = Var(x).

Continuous Distributions

These can follow any function where:

${f(x)}\geq {0}$ for all ${x}\in$ range of f(x).
$\int _{b}^{a}f(x)\,dx=1$ if the domain of $f(x)$ is $a\leq x\leq b$ .

Furthermore, x can be any value $-\infty \leq x\leq \infty$ that is also within the domain of $f(x)$ . Cumulative frequencies are calculated by integrating $f(x)$ .
In addition, the syllabus expects knowledge regarding three particular continuous probability distributions.

Exponential, Exp(λ)
This distribution models the expected interval between events (assumed to be instantaneous) in a Poisson distribution. Eg for P_o(2) calls an hour, the expected number of calls in one hour is 2; the expected value of the exponential distribution, ${\tfrac {1}{\lambda }}$ , is half an hour.
The exponential distribution can also be seen as the continuous equivalent of the geometric distribution, which models the time until the first success.

Normal, N(μ,σ²)
This is the most interesting distribution, and the most relevant to the statistics option. Due to the central limit theorem, a large portion of the Statistics option is based on the normal distribution. On questions about the normal distribution, the question must state that the data at hand “follows a normal distribution”, “is normally distributed”, etc. This makes it easy to identify.
The standard normal variable Z follows the distribution and is a way of converting to Z-scores, which is important for calculating confidence interval and hypothesis testing.

Normal approximation to the binomial distribution
For large values of n, X~B(n,p) can be approximated as X~N(np,npq). (This can be shown on a histogram.)
There are different estimates for how large n should be; greater than 5 is usually a good approximation, yet the IB states $(np)\geq 10$ and $(nq)\geq 10$ as rules. In situations which do not satisfy these conditions it should be clearly stated that the approximation is not good.

Summary of Distributions

The summary of equations, functions and notations are shown below for each distribution.

Discrete Distributions

Distribution	Notation	Probability Mass Function	Mean	Variance
Binomial	X~B(n,p)	${\binom {n}{x}}p^{x}(1-p)^{n-x}$ for $x=0,1$	$np$	$np(1-p)$
Poisson	X~Pois(m)	${\cfrac {m^{x}e^{-m}}{x!}}$ for $x=0,1,\ldots$	$m$	$m$
Geometric	X~Geo(p)	$pq^{x-1}$ for $x=1,2,\ldots$	${\cfrac {1}{p}}$	${\tfrac {q}{p^{2}}}$
Negative Binomial	X~NB(r,p)	${\binom {x-1}{r-1}}p^{r}q^{x-r}$ for $x=r,r+1,\ldots$	${\tfrac {r}{p}}$	${\tfrac {rq}{p^{2}}}$

Continuous Distribution

Distribution	Notation	Probability Density Function	Mean	Variance
Exponential	X~Exp( $\lambda$ )	$\lambda \mathrm {e} ^{-\lambda x},x\geqslant 0$	${\tfrac {1}{\lambda }}$	${\frac {1}{\lambda ^{2}}}$
Normal	X~N( $\mu ,\sigma ^{2},N$ )	${\frac {1}{\sigma {\sqrt {2\pi }}}}\mathrm {e} ^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}$	$\mu$	$\sigma ^{2}$

Normal Distribution

Linear Combinations

It is often useful to combine variables, eg to determine the probability that $X>Y$ . This is done by solving an inequality and basing a new variable, for instance $U$ on the random variable size. For instance:
$\mathrm {P} (X>Y)\!$
$=\mathrm {P} (X-Y>0).\!$
${\text{let }}U=X-Y\!$

We now need to find $U$ using the rules:

$\mathrm {E} (X\pm Y)=\mathrm {E} (X)\pm \mathrm {E} (Y)$
$\mathrm {Var} (X\pm Y)=\mathrm {Var} (X)+\mathrm {Var} (Y)$

Note that the variance is always added. It is also important to convert standard deviation to variance before attempting to combine variables.

Questions also discuss combinations where multiple picks are made. Note the difference between

$U=X_{1}+X_{2}+X_{3}+X_{4}\!$ and
$U=4X\!$ .

In the first, four separate picks are made. Variance becomes $4\mathrm {Var} (X)$ .
In the second, the value of one pick is multiplied by four. Variance in this case is $\mathrm {Var} (4X)=16\mathrm {Var} (X)$ .
In short, separate picks should be treated as separate variables, despite having the same μ and σ.

Central limit theorem

When taking samples from a non-normal population $X$ whose mean is μ variance is σ², the averages of these samples will be distributed normally as ${\bar {X}}\sim N(\mu ,{\tfrac {\sigma ^{2}}{n}})$ where n is the number of data points each sample is based on (sample size). This applies when $n\geq 30$ .

The samples must be independent. Although a proof of the CLT is not required, it may be useful to know that it stems from a binomial distribution: either ${\bar {x}}<\mu$ or ${\bar {x}}\geq \mu$ . The probability is constant and independent, meaning that the sample means are described by a binomial function. We already know by the normal approximation to the binomial distribution that when the sample size is large enough, the distribution will be approximately normal.

This is a reliable approximation for sample sizes $n\geq 30$ . Note that the variance, ${\tfrac {\sigma ^{2}}{n}}$ , of the normal distribution decreases with larger values of n, meaning that the probability distribution will be narrower, ie more precise. The "standard deviation" of the normal distribution when using the CLT, ${\tfrac {\sigma }{\sqrt {n}}}$ , is also known as the sampling error or standard error.

Normality of a proportion

Proportions with large sample sizes also follow a normal distribution. Following a similar logic as for sample means, a proportion of a sample can either be a success or failure. The probability of this is considered fixed, meaning that the distribution is binomial. When the sample size is large enough, there is therefore a normal distribution where $\mu ={\widehat {p}}$ , the sample proportion:
${\widehat {p}}={\frac {X}{n}},~~~{\text{ where }}{\begin{cases}{\widehat {p}}={\text{ sample proportion}}\\X={\text{ number of successes}}\\n={\text{ sample size.}}\end{cases}}$
If p is the true proportion of successes and n is the sample size then X~B(n,p). Hence we can show that

Expected value $\mathrm {E} ({\widehat {p}})=\mathrm {E} ({\tfrac {1}{n}}X)={\tfrac {1}{n}}\mathrm {E} (X)={\tfrac {1}{n}}\times np=p$
Variance $\mathrm {Var} ({\widehat {p}})=\mathrm {Var} ({\tfrac {1}{n}}X)={\tfrac {1}{n^{2}}}\mathrm {Var} (X)={\tfrac {npq}{n^{2}}}={\tfrac {pq}{n}}$

By the Central Limit Theorem, we can say that for large values of n, ${\widehat {p}}\sim \mathrm {N} (p,{\tfrac {pq}{n}})$ .

Confidence Intervals

90% confidence interval. 90% of data within 0±a.

A confidence interval is a range measured from the mean of a distribution in which a certain fraction of samples lie. It is often represented as a percentage, for instance saying that "90% of samples weigh 2±0.01 kg".
Confidence intervals work the same way for sample means and for proportions. In each case let $\mu ={\bar {x}}$ or $\mu ={\widehat {p}}$ . The difference arises when the population variance is not known. These two situations are explored below.

When σ is known

The data booklet gives the expression for a confidence interval as:
${\bar {x}}\pm z\times {\frac {\sigma }{\sqrt {n}}}$ (given n ≥ 30).
This same expression is merely written in terms of the standard distribution for a sample:
${\widehat {p}}\pm z\times {\sqrt {\frac {{\widehat {p}}{\widehat {q}}}{n}}}$ (when np ≥ 10 or nq ≥ 10).

$z$ is the z-score corresponding to the percentage of the confidence interval. This can be looked up using the tables which occupy the last few pages of the data booklet or using the invNorm function on the calculator. Note, however, that entering invNorm(.9) will not give the Z-score of a 90% confidence interval. The remaining 10% must be distributed evenly both above and below the target range that is within 90% of the mean. Therefore z = invNorm(0.95). This can be clearly seen in the illustration above - we want to find the value of a, and should therefore use either 0.95 or 0.05. This is the same concept as a two-tailed test (described below) - if we were saying that 90% of the values were below a certain value, then we would use invNorm(0.9).

Calculator functions
ZInterval: Either enter a set of data or statistics. Not that in both cases, σ is clearly requested. C(onfidence)-Level as a fraction. When using data, select the list name at set Frequency=1.
1-PropZInt: x is the number of successes in n trials. C(onfidence)-Level as a fraction.

When σ is unknown

When the population standard distribution, σ, is unknown, we must approximate it using the sample data. $S_{n-1}^{2}$ is used to represent σ. Note that the sample standard deviation $S_{n}$ may be known without the population standard deviation being known. When σ was known, we said that
${\frac {{\bar {X}}-\mu }{\tfrac {\sigma }{\sqrt {n}}}}=Z$ , the standard normal distribution N(0,1). Likewise, the distribution when σ is not known is
${\frac {{\bar {X}}-\mu }{\tfrac {S_{n-1}}{\sqrt {n}}}}=t$ , known as the t-distribution. It is simply a "fatter" version of the standard normal curve N(0,1).

When using a t-distribution, state the degrees of freedom: ν = n-1 where, as usual, n is the sample size. (This gains significance in hypothesis testing.)

Calculator function
TInterval: Data or statistics. When using data, same input method as ZInterval is used. Note that $S_{x}$ is the same as $S_{n-1}$ , and has to be calculated manually from the sample standard deviation $S_{n}$ using $S_{n-1}^{2}={\tfrac {n}{n-1}}S_{n}^{2}$ (from data booklet).

Determining appropriate samples sizes

In order for an estimate to be correct, the size of samples must be high enough. As n increases, the variance falls, increasing the precision (distribution narrows).
The example below is taken from Haese & Harris' IBDP Mathematics (Options):

"How large should a sample be if we wish to be 98% confident that the sample mean will differ from the population mean by less than 0.3 if we know that population standard deviation σ = 1.365?"

This means that $-0.3<\mu -{\bar {x}}<0.3$ [where ${\bar {x}}$ is the furthest acceptable point from μ.]

From the data booklet formula
$\mu ={\bar {x}}\pm z\times {\tfrac {\sigma }{\sqrt {n}}}$
we know that:

${\bar {x}}-z\times {\tfrac {\sigma }{\sqrt {n}}}<\mu <{\bar {x}}+z\times {\tfrac {\sigma }{\sqrt {n}}}$

and invNorm(0.99) [not 0.98!] is 2.326:
$-2.326{\tfrac {\sigma }{\sqrt {n}}}<\mu -{\bar {x}}<2.326{\tfrac {\sigma }{\sqrt {n}}}$

$\therefore 2.326{\tfrac {\sigma }{\sqrt {n}}}=0.3$

${\sqrt {n}}={\tfrac {2.326\sigma }{0.3}}={\tfrac {2.326\times 1.365}{0.3}}\approx 10.583$

$n\approx 112$
So a sample of 112 should be taken to be 98% sure that sample means will differ from the population mean by less than 0.3 (n was rounded up to 112).

Note that for proportions, ${\widehat {p}}$ might not always be known. In such a case the largest possible error should be used, ie
$\pm z{\sqrt {\tfrac {\left({\frac {1}{2}}\right)\left({\frac {1}{2}}\right)}{n}}}\to \pm z{\tfrac {1}{2{\sqrt {n}}}}$
As above, this is set equal to the maximum acceptable range, eg 0.03 if the proportion must be "within 3%".

Significance/Hypothesis Testing

The aim of hypothesis testing is to consider the validity of hypotheses at particular levels of significance and come to a conclusion regarding their accuracy. The idea is to

Formulate a hypothesis
Collect sample data
Determine whether the data supports the hypothesis

A level of significance is much like the confidence level of significance testing. For example, a 90% confidence interval contains 90% of the spread, while a 10% level of significance says that a hypothesis is true or false with a chance of 90% (10% chance of error).

Null and alternative hypothesis

In any hypothesis test, there will be two mutually exclusive hypotheses:

$H_{0}$ , the null hypothesis, which states equality. This is assumed true until proven false.
$H_{1}$ , the alternative hypothesis, which is adopted if $H_{0}$ has been proven false by random sample data.

For example, testing that the mean number of phone-calls per hour is greater than 6:

$H_{0}:\mu =6$
$H_{1}:\mu >6$

An alternative hypothesis can either be one-sided, as above, or two sided. If we wanted to prove that the mean number of phone-calls is not 6, we would say that $H_{0}:\mu \neq 6$ . This would imply either that $\mu >6$ or that $\mu <6$ . This creates a slight difference in how the probability is calculated (see the invNorm() argument in Confidence Intervals) but this is largely handled by the calculator.

Significance Testing for Mean and Proportion

To perform a test, data must be collected, giving a value for ${\bar {x}}$ . Then the z- or t-score ( $z*\!$ or $t*\!$ ) is calculated, depending on whether population σ is known, using

$z={\frac {{\bar {x}}-\mu }{\tfrac {\sigma }{\sqrt {n}}}}$ or
$t={\frac {{\bar {x}}-\mu }{\tfrac {S_{n-1}}{\sqrt {n}}}}$ (remembering to state $n-1$ degrees of freedom).

When using proportions:

${\bar {x}}={\widehat {p}}={\tfrac {x}{n}}$
$\sigma =\sigma _{\widehat {p}}={\sqrt {\tfrac {pq}{n}}}$

All the CLT requirements need to be met for each respective method: $n\geq 30$ for sample means and $np\geq 10$ or $nq\geq 10$ for proportions.

The next step is to determine the p-value , the probability of the z- or t-score occurring. For z-scores this can be done with normalcdf(), but for t-scores the whole process must be done on the calculator (explained below). The p-value measures the likelihood of ${\bar {x}}$ occurring with a mean of μ and standard deviation σ. If the p-value is low, there is a high possibility that either the sample is wrong (sample size could be increased to verify this) or the mean is not μ (reject null hypothesis). The cut-off for which the p-value is considered "too" low is determined by the level of significance: given a 0.05 (5%) level of significance, $H_{0}$ will be rejected if the p-value is below 0.05.

For two-tailed tests ( $\neq$ ), the p-value is the probability $\mathrm {P} (t\geq t*)+\mathrm {P} (t\leq -t*)$ whereas for a single-tailed test ( $<$ or $>$ ) it is simply $\mathrm {P} (t\geq t*)$ or the equivalent in terms of $z\!$ and $z*\!$ .

Steps to hypothesis testing

State $H_{0}$ , $H_{1}$ and whether the test is one- or two-tailed.
State whether z- or t-distribution, calculate corresponding test statistic $z*\!$ or $t*\!$ .
State decision rule (reject $H_{0}$ if p-value is...). Calculate p-value for test statistic.
Make decision: "Reject $H_{0}$ " or "Accept $H_{1}$ .
Brief statement putting the decision into context.

The "brief" statement will involve as much of the information from as possible. For example, "Based on the a sample of 200 cookies, insufficient evidence is provided to accept at the 1% level of significance that more than 60% of cookies contain chocolate".

There is also a slightly different way of determining whether the calculated z-score is acceptable. Instead of converting the z-score into a probability which is then compared to the p-value, calculate the critical value z-score based on the level of significance. Then use logic to determine whether the calculated z-score falls within the rejection region. For instance, when testing at a 5% level of significance, a "<" one-tailed test $H_{0}$ can be rejected if $z*\!$ < invNorm(0.05). For a ">" one-tailed test, reject if > invNorm(0.95) and for a two-sided test reject if $\left\vert z*\right\vert \geq$ invNorm(0.975).

Calculators (TI)
Calculator functions are quite straightforward to use. In general:

$\mu _{0}:$ and $\mu :$ combine to form the alternative hypothesis. For example, if $H_{0}:\mu <6$ then $\mu _{0}=6$ and $\mu <\mu _{0}$ . The same applies to " ${\text{prop}}$ " and $P_{0}$ in the 1-PropZTest.
$x$ is the number of successes
$n$ is the number of trials
${\bar {x}}$ is the sample mean
$\sigma$ is the population standard deviation (z-tests) and $S_{x}$ is $S_{n-1}$ , the unbiased estimate of population variance.

Sorting out which variables are known will provide the right function in the case that it isn't clear to start with.

Type I and II errors

Falsely rejecting $H_{0}$ is a Type I error. Chance of this occurring is equal to the level of significance at which the test is performed.
Falsely accepting $H_{0}$ is a Type II error. Chances of making a Type II error increase with stricter levels of significance as the critical region shrinks. Calculating the probability of a Type II error occurring requires an alternative value, ie the "true" mean. Type II error is the chance of accepting that the mean is a when it is in fact b, meaning that the probability of Type II error occurring depends on b. Therefore Type II is the chance of getting the recorded sample mean ${\bar {x}}$ when the true mean is b - this can be calculated using normalpdf.

Chi-Squared Distribution

This distribution can be used to test whether a data set follows a particular distribution by comparing expected values to observed data. It can likewise be used to hypothesise whether two variables are dependent.

The chi-squared (χ²) distribution is dependent on the degrees of freedom. The higher the degree of freedom, the closer to the normal curve it becomes.

Note that all χ² tests are one-tailed. So don't go dividing the p-value in two or any of that funny stuff.

Goodness of fit

For GOF tests, $H_{0}$ states that the data follows a distribution while $H_{1}$ states that it does not follow the distribution. For example:

$H_{0}$ : The data is from a uniform distribution
$H_{1}$ : The data is not from a uniform distribution.

Degrees of freedom ν = number of classes (n) - number of restrictions (k). When there do not appear to be any restrictions (most cases), k 1 due to the fact that there are a finite number of classes. Once the values of all but one have been found, the last class is unable to fluctuate. This means that in general for GOF-tests ν = n - 1.

To calculate the probability that a data set follows a particular distribution, enter the observed and expected values into separate lists. If any of the expected frequencies are below five, combine this group with a neighbouring group. This is to avoid dividing by small numbers when calculating the test statistic, which would produce disproportionately large values. Because the number of classes falls when this done, decrease the degrees of freedom accordingly. Additionally, subtract 1 from ν for each statistic ( $m,{\bar {x}},p,\mu ,\sigma ^{2}$ ) which is used to calculate expected data but which is itself based on the observed data.

Then perform a χ²GOF-Test using that data. As with hypothesis testing, the p-value shows how likely it is to get a χ²-score greater than the one achieved with this data set, so if it is smaller than the level of confidence $H_{0}$ should be rejected and it can be concluded that the data does not follow the distribution at this particular level.

Contingency tables

When testing for the independence of variables, a two-variable contingency table is used. This lists the combinations of frequencies of the two variables. For instance:

Contingency table relating smoking to hypertension
	Amount of smoking
Degree of hypertension	None	Moderate	Heavy	Total
Severe	10	14	20	44
Mild	20	18	31	69
None	40	22	25	87
Total	70	54	76	200

When entering the data into the calculator the "total" column and row are omitted. Expected values do not have to be calculated manually (this would be done by multiplying row total by column total and dividing by "total total", eg 70×44/200 = 15.4, the first cell in the expected value table). Instead, a χ²-test is performed and the expected values will be inserted into the "Expected" matrix.
Degrees of freedom in a χ²-test are (rows-1)(columns-1).
When there is a two by two contingency table, ν = 1. Normally one would use Yate's continuity correction, however this has been removed from the syllabus. Therefore just proceed as normal with ν = 1.

For independence test:

$H_{0}$ : Variables are independent.
$H_{1}$ : Variables are dependent.

As always, the p-value is the probability of a χ²-value larger than that observed occurring. $H_{0}$ is therefore rejected if the level of significance is lower than the p-value.