Statistics Ground Zero/Descriptive Statistics
Descriptive StatisticsEdit
Descriptive statistics summarize the quantitative character of your data set. They are used to describe data so as to illustrate how some phenomenon appears in the cases observed. Descriptive statistics answer questions like What proportion of the cases have blue eyes? or What is the typical household income of the cases observed?. In computing descriptive statistics, we do not intend to make any inference from the data collected to any population larger than or outside of the cases observed.
Population and SampleEdit
Where I give a calculation in Descriptive Statistics I will give the formula for the population parameter unless otherwise specified. The population parameter is used when data have been collected for all the cases under investigation and are used in calculation. If the data set is sampled (perhaps by collecting data for some representative number of cases) then the sample statistic is used instead. The parameter and statistic differ and for the parameter the mean is computed over the total number of cases in the data set (N) and this population mean is used in further computations or parameters. For the sample statistic the mean is computed over the total number of cases minus one (N-1) and this sample mean is used in the computation of further statistics. Thus the sample statistic approximates the size effect of a sample, since as N increases (approaches the size of the population) the difference between sample and population decreases. For a small number of cases subtracting one from N has a very large effect; for a large number of cases subtracting one from N has a much smaller effect.
FrequencyEdit
Frequencies can be computed for both nominal, ordinal variables and for continuous variables - though with a slightly different meaning. For a discrete (that is nominal or ordinal) variable, the frequency is the count of instances of the level in the data. For a continuous variable, it is common to bin the values observed into groups of particular width (for example a bin might contain scores between 0 and 5, the next between 6 and 10 and so on.
Dealing with a single variable (univariate)Edit
If we imagine a class of school students and test scores, we know that a number of students might score 50/100, another group 65/100 and so on. The number scoring at each level is the frequency of that score. If we record these frequencies we have the frequency distribution of that variable. We can tabulate frequencies as counts, percentages and cumulative percentages of the data.
The following table tabulates age data. For each age to the nearest year encountered in the data, the frequency is counted and the absolute figure recorded, and percentages calculated.
AGE | Frequency | Percent | Valid Percent | Cumulative Percent |
10.00 | 5 | 17.9 | 17.9 | 0 + 17.9=17.9 |
11.00 | 10 | 35.7 | 35.7 | 17.9+35.7=53.6 |
12.00 | 10 | 35.7 | 35.7 | 53.6+35.7=89.3 |
13.00 | 3 | 10.7 | 10.7 | 89.3+10.7=100.0 |
Total | 28 | 100.0 | 100.0 |
In this example, the valid percent column is identical to the percent column because there are no missing data, that is cases for which the age is unknown.
Dealing with more variables (bivariate/multivariate) Cross TabulationEdit
We can describe the intersection of two categorical (that is nominal and ordinal) variables by crosstabulation. Here, I crosstabulate eye colour by gender for a group of 76 students, equally divided by gender. In this case there are two columns and five rows: this is a two by five table. No significance is attached to which variable goes in rows or which in columns.
Crosstabulation eye colour x gender | ||||
gender | ||||
f | m | |||
eyecolour | blue | 6 | 6 | |
brown | 12 | 12 | ||
green | 7 | 7 | ||
grey | 4 | 6 | ||
other | 9 | 7 | ||
Total | 38 | 38 |
Each cell in the table holds the count of how many students of each gender were observed to have just that eye colour. So, six of the male students had brown eyes and four of the female students had grey eyes and so on. These are the observed counts for the crosstabulation of these variables. Later we will see that these can be compared to the expected counts predicted by probabilities.
Central TendencyEdit
A common summary for numerical data is the location of the central point or middle of the data. This point is taken as an indicative answer to the question what is the most typical value for this variable? There is more than one way to determine the centre. I explain the three most common measures of central tendency below.
ModeEdit
The mode is the most frequently occuring value in the data. If we go through the observations and tick off once for each occurence of a particular score we obtain the frequency count for the data. The mode is the value with the highest frequency count. There is no guarantee that there will be a single modal value and so sometimes we hear data described as bi-modal or multi-modal.
The mode is the least powerful of the measures of central tendency since it exploits so little information from the data.
The mode is the only measure of central tendency that can be computed for nominal data. It can also be computed for ordinal data.
The mode is often visualised with a histogram.
ExampleEdit
Suppose that we tally the age to the nearest whole year of a class of schoolchildren and we get the following result
Age in Years | Frequency of Occurence |
---|---|
10 | 5 |
11 | 10 |
12 | 10 |
13 | 3 |
Here there are two modal values: 11 years and 12 years. This is a bimodal distribution of scores.
Here is a histogram of the data:
MedianEdit
The median is the middle score in the data set. The scores should be ranked and then if the number of cases is odd, the median is the middle ranking score. If the numbers are even, then the two mid points are summed and divided by two to calculate the median.
The median exploits more information about the data than the mode since the data are ranked, and is a more powerful expression of the central tendency.
The median can be computed for ordinal, interval and ratio data.
ExampleEdit
Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. This is an even number, so we take the two middle values, add them and divide by two. This gives 11.5 years as the median value.
MeanEdit
Here, I take mean to be the arithmetic mean or average - ignoring items like the geometric and harmonic means.
The mean is calculated as the sum of all scores for a variable divided by the number of cases. The mean is the most powerful indicator of central tendency, exploiting the most information from the data. The formula is often written as
The mean can be computed only for interval and ratio data. ^{[1]}
ExampleEdit
Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The sum of the ages in years for this class is 319. The total number of cases is 28 that is N = 28. So we divide 319 by 28 to get the mean age: 11.39 years.
DispersionEdit
Dispersion is the degree of spread of values in a data set. Variation is central to statistical thinking. I will introduce some of the main indicators of the dispersion of values.
Dispersion is important in descriptive statistics because two groups or two variables might have similar means, medians or modes for example, but differ widely in dispersion. For example, it is possible that the mean income in Mumbai and Los Angeles are the same (I have no idea; I haven't checked) but you would not be surprised to discover that the spread of income across the populations of these two cities was very different.
RangeEdit
The range of a data set is the distance between the highest observed value and the lowest observed value of a variable.
ExampleEdit
Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The maximum value of age is 13 and the minimum value of age is 10. Range is therefore 13 -10 which is 3.
Quartiles and the Interquartile RangeEdit
QuartilesEdit
The quartiles are three points in the data that divide the cases into equal fourths. One of the quartile points is the median - quartile two. The first quartile cuts off the lowest 25% of the data set and the third quartile cuts off the highest 25% of the data set.
Interquartile RangeEdit
The interquartile range (IQR) is defined as quartile 3 minus quartile 1. This is a robust indication of the spread of values around the median. One characterisation of outlier defines it as a point more than one and a half times the IQR from the boundary of the IQR. Outlier is understood as a value so extreme as to be untypical.
ExampleEdit
Consider the following data consisting of 32 pupils scores in mathematics. The table shows frequency counts and cumulative percentages.
Exam Score | Count | Cumulative Percent |
---|---|---|
39 | 1 | 3.125 |
42 | 1 | 6.250 |
44 | 1 | 9.375 |
45 | 1 | 12.500 |
47 | 1 | 15.625 |
48 | 1 | 18.750 |
50 | 3 | 28.125 |
51 | 1 | 31.250 |
52 | 3 | 40.625 |
53 | 2 | 46.875 |
54 | 1 | 50.000 |
55 | 2 | 56.250 |
56 | 3 | 65.625 |
57 | 1 | 68.750 |
58 | 2 | 75.000 |
59 | 1 | 78.125 |
60 | 2 | 84.375 |
62 | 2 | 90.625 |
63 | 1 | 93.750 |
64 | 2 | 100.00 |
Total | 32 | 100 |
The median score is 54.5. The quartiles are
Quartile | Score |
---|---|
First (lowest 25%) | 50 |
Second (to median) | 54.5 |
Third (to 75%) | 58.5 |
So the interquartile range is 58.5 - 50 = 8.5. This is visualised by a boxplot
The interquartile range is shown as a shaded box with a line indicating the location of the median score. Also indicated are the minimum and maximum scores,shown by the 'whiskers'. On this box plot the whiskers represent the actual minimum and maximum values. Some plots indicate IQR + or - 1.5(IQR) instead of the minimum and maximum.
DeviationEdit
Deviation measures the distance between an observed score and the expected for the variable under consideration (or perhaps the distance from some ideal value, in which case we often call the deviation the error). For a continuous variable, the expected value is the mean.
Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The mean age is 11.39 years. To calculate the deviation of a score, for example 13, from the mean we take 11.39 from 13 to get 1.61. We notice that the deviation can be a positive or negative distance. Thus, taking a score of 11 years we calculate the deviation from the mean to be -0.39 years.
It would be useful to be able to characterise the diffusion in a data set by the average deviation from the mean but we will see that initially it turns out to be more straightforward to deal with the average squared deviation from the mean.
VarianceEdit
Variance is the mean squared deviation of a dataset. If we remember that variance is a mean, the definition becomes very easy to understand. The formula for the population variance is
The top half of this formula is the sum of squares. The sum of squares of the deviations divided by the number of cases is the variance. It is the average distance of a score in the data set from the mean for that variable.
ExampleEdit
Consider the following set of values {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. This set has a mean 5.5: if we sum the deviations we get zero. To avoid trying to divide zero by N we square the deviations. So, tabulated we get
Value | Deviation | Squared Deviation |
---|---|---|
1 | -4.5 | 20.25 |
2 | -3.5 | 12.25 |
3 | -2.5 | 6.25 |
4 | -1.5 | 2.25 |
5 | -0.5 | 0.25 |
6 | 0.5 | 0.25 |
7 | 1.5 | 2.25 |
8 | 2.5 | 6.25 |
9 | 3.5 | 12.25 |
10 | 4.5 | 20.25 |
Sum of Squares | 82.50 |
After the squaring operation, we arrive at a figure for the sum of the squared deviations which we can divide by N to get the variance. Since N is 10, the variance is 8.25.
Understanding of varianceEdit
This measure, the variance, is a very useful summary statistic for the dispersion in our data. Moreover, variance plays a central role in statistical thinking. Many common statistical techniques involve the computation and comparison of the variances of samples, populations or between variables. It suffers however from one draw back: suppose that the original variable represented height in meters, the variance is now expressed in meters squared. We have transformed a linear measure into a measure of area, a geometric measure. Squaring the deviations avoids a zero result, but the final figure is expressed in different units than the original. The solution lies in the derivation of the standard deviation.
Standard DeviationEdit
The standard deviation is calculated straightforwardly as the square root of the variance. So the formula can be written
This quantity is now in the same units as the original values, overcoming the limitation on interpreting the variance.
Informally we might say that for a randomly distributed variable observations are typically within one or two standard deviations of the mean and we will see that we can be more precise below.
ShapeEdit
SkewnessEdit
Skewness tells you to what extent the distribution of values is symmetrical around the mean. If the distribution of values is symmetrical around the mean then the skew is nil. The normal or Gaussian distribution looks like this:
This distribution of values can be expressed in terms of standard deviation. Around 68% of values lie within one standard deviation from the mean. Ninety-six percent or so are within two standard deviations from the mean. Much smaller fractions of the data set have values beyond two standard deviations. Further, in a normal distribution the median and the mean will be very close in value, in fact for an ideal normal distribution mean = median = mode.
Distributions may be skewed with a long tail to the left - negative skew; or a long tail to the right - positive skew.
Kurtosis^{[2]}Edit
Kurtosis refers to the tailedness of the data. A distribution with a high kurtosis has tails (occasional extreme values) that are more extreme (heavier) than the tails of a normal distribution. The red line D in the graph below shows such a distribution, but high kurtosis generally does not correspond to such a pointy peak. A distribution with a low kurtosis has tails that are less extreme (lighter) than the tails of the normal distribution. The blue line W in the graph is an example of such a distribution, but low kurtosis generally does not tell you anything about the peak (the beta(.5,10) is an example of an infinitely pointy distribution with an infinitely pointy peak). The normal distribution (the black line, N) has a kurtosis of zero.
In a data set with high kurtosis - long tails - more of the variability in the data is due to relatively infrequent extreme deviations from the mean for that variable. In a data set with low kurtosis, more of the variability in the data is due to moderate but frequent deviations.
The following graph illustrates the kurtosis of some well known distributions. Note, however, that the tails are not easily seen in such density graphs: Even when the distribution has "fat tails," the tails are still close to zero and not easily compared. Thus, it is difficult to discern kurtosis from these graphs. A better way to visualize the tails in reference to the normal distribution (i.e., kurtosis) is to use a normal quantile-quantile plot.
NotesEdit
- ↑ Occasionally you will see the mean given as the measure of central tendency for ordinal data and this can be justified if you are persuaded that underlying the rank scale there is a relatively isomorphous interval scale.
- ↑ Terminology: in this section we will talk about what is technically excess kurtosis. In calculating excess kurtosis we adjust so that the kurtosis of the normal distribution is zero (rather than 3).