Statistics/Testing Data/Chi-SquaredTest

General idea

Assume you have observed absolute frequencies ${\displaystyle o_{i}}$  and expected absolute frequencies ${\displaystyle e_{i}}$  under the Null hypothesis of your test then it holds

${\displaystyle V=\sum _{i}{\frac {(o_{i}-e_{i})^{2}}{e_{i}}}\approx \chi _{f}^{2}.}$

${\displaystyle i}$  might denote a simple index running from ${\displaystyle 1,...,I}$  or even a multiindex ${\displaystyle (i_{1},...,i_{p})}$  running from ${\displaystyle (1,...,1)}$  to ${\displaystyle (I_{1},...,I_{p})}$ .

The test statistics ${\displaystyle V}$  is approximately ${\displaystyle \chi ^{2}}$  distributed, if

1. for all absolute expected frequencies ${\displaystyle e_{i}}$  holds ${\displaystyle e_{i}\geq 1}$  and
2. for at least 80% of the absolute expected frequencies ${\displaystyle e_{i}}$  holds ${\displaystyle e_{i}\geq 5}$ .

Note: In different books you might find different approximation conditions, please feel free to add further ones.

The degrees of freedom can be computed by the numbers of absolute observed frequencies which can be chosen freely. We know that the sum of absolute expected frequencies is

${\displaystyle \sum _{i}o_{i}=n}$

which means that the maximum number of degrees of freedom is ${\displaystyle I-1}$ . We might have to subtract from the number of degrees of freedom the number of parameters we need to estimate from the sample, since this implies further relationships between the observed frequencies.

Derivation of the distribution of the test statistic

Following Boero, Smith and Wallis (2002) we need knowledge about multivariate statistics to understand the derivation.

The random variable ${\displaystyle O}$  describing the absolute observed frequencies ${\displaystyle (o_{1},...,o_{k})}$  in a sample has a multinomial distribution ${\displaystyle O\sim M(n;p_{1},...,p_{k})}$  with ${\displaystyle n}$  the number of observations in the sample, ${\displaystyle p_{i}}$  the unknown true probabilities. With certain approximation conditions (central limit theorem) it holds that

${\displaystyle O\sim M(n;p_{1},...,p_{k})\approx N_{k}(\mu ;\Sigma )}$

with ${\displaystyle N_{k}}$  the multivariate ${\displaystyle k}$  dimensional normal distribution, ${\displaystyle \mu =(np_{1},...,np_{k})}$  and

${\displaystyle \Sigma =(\sigma _{ij})_{i,j=1,...,k}={\begin{cases}-np_{i}p_{j},&{\mbox{if }}i\neq j\\np_{i}(1-p_{i})&{\mbox{otherwise}}\end{cases}}}$ .

The covariance matrix ${\displaystyle \Sigma }$  has only rank ${\displaystyle k-1}$ , since ${\displaystyle p_{1}+...+p_{k}=1}$ .

If we considered the generalized inverse ${\displaystyle \Sigma ^{-}}$  then it holds that

${\displaystyle (O-\mu )^{T}\Sigma ^{-}(O-\mu )=\sum _{i}{\frac {(o_{i}-e_{i})^{2}}{e_{i}}}\sim \chi _{k-1}^{2}}$

distributed (for a proof see Pringle and Rayner, 1971).

Since the multinomial distribution is approximately multivariate normal distributed, the term is

${\displaystyle \sum _{i}{\frac {(o_{i}-e_{i})^{2}}{e_{i}}}\approx \chi _{k-1}^{2}}$

distributed. If further relations between the observed probabilities are there then the rank of ${\displaystyle \Sigma }$  will decrease further.

A common situation is that parameters on which the expected probabilities depend needs to be estimated from the observed data. As said above, usually is stated that the degrees of freedom for the chi square distribution is ${\displaystyle k-1-r}$  with ${\displaystyle r}$  the number of estimated parameters. In case of parameter estimation with the maximum-likelihood method this is only true if the estimator is efficient (Chernoff and Lehmann, 1954). In general it holds that degrees of freedom are somewhere between ${\displaystyle k-1-r}$  and ${\displaystyle k-1}$ .

Examples

The most famous examples will be handled in detail at further sections: ${\displaystyle \chi ^{2}}$  test for independence, ${\displaystyle \chi ^{2}}$  test for homogeneity and ${\displaystyle \chi ^{2}}$  test for distributions.

The ${\displaystyle \chi ^{2}}$  test can be used to generate "quick and dirty" test, e.g.

${\displaystyle H_{0}:}$  The random variable ${\displaystyle X}$  is symmetrically distributed versus

${\displaystyle H_{1}:}$  the random variable ${\displaystyle X}$  is not symmetrically distributed.

We know that in case of a symmetrical distribution the arithmetic mean ${\displaystyle {\bar {x}}}$  and median should be nearly the same. So a simple way to test this hypothesis would be to count how many observations are less than the mean (${\displaystyle n_{-}}$ )and how many observations are larger than the arithmetic mean (${\displaystyle n_{+}}$ ). If mean and median are the same than 50% of the observation should smaller than the mean and 50% should be larger than the mean. It holds

${\displaystyle V={\frac {(n_{-}-n/2)^{2}}{n/2}}+{\frac {(n_{+}-n/2)^{2}}{n/2}}\approx \chi _{1}^{2}}$ .

References

• Boero, G., Smith, J., Wallis, K.F. (2002). The properties of some goodness-of-fit test, University of Warwick, Department of Economics, The Warwick Economics Research Paper Series 653, http://www2.warwick.ac.uk/fac/soc/economics/research/papers/twerp653.pdf
• Chernoff H, Lehmann E.L. (1952). The use of maximum likelihood estimates in ${\displaystyle \chi ^{2}}$  tests for goodness-of-fit. The Annals of Mathematical Statistics; 25:576-586.
• Pringle, R.M., Rayner, A.A. (1971). Generalized Inverse Matrices with Applications to Statistics. London: Charles Griffin.
• Wikipedia, Pearson's chi-square test: http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test