Analysis of a single sampleEdit
Before getting started with the theory, we will consider a simple example. The example will expose the basic components of a statistical analysis and aim to give an intuitive understanding of the type of results that can be obtained from such an analysis.
A naive question on the human body temperature is whether or not, it on average is 37°C? To answer this question a study was done, which measured the temperature on 130 individuals. The data can be downloaded here normtemp.dat and documentation can be found in (Shoemaker, 1996), and references therein. A brief summary of the data is the following, so-called stem-and-leaf plot:
Stem-and-leaf plot for tempC (Body Temperature / Celsius) tempC rounded to nearest multiple of .1 plot in units of .1
35s | 7 35. | 899 36* | 011 36t | 222222333333 36f | 444444555 36s | 66666666666677777777777777 36. | 888888888888888999999999999 37* | 0000000000111111111111111111 37t | 2222222333333 37f | 44445 37s | 7 37. | 8 38* | 38t | 2
This shows, for example, that one person (first from the top) had a body temperature of 35.7°C, another (second from top) had 35.8°C, while two (three and four from the top) had 35.9°C, and so on. Most had temperatures around 36.6°C to 37.3°C.
A very similar graphical presentation of the same data is given in the histogram below:
If we want to summarize the data using numbers, two important measures are the mean and the standard deviation (SD). For this sample of data, the mean is 36.805°C and the SD is 0.407°C. We will later give a more precise definition of these two summary statistics but for now we can think of them as giving the central point of the data (the mean) and a measure of the variability of the data around their mean. Graphically, the mean can be thought of as the balancing point of the distribution were the columns of the histogram to have weight identical to their area. The SD can for this kind of data be interpreted as follows: If we take the mean and subtract two times the SD, respectively add two times the SD, then approximately 95% of all the subjects will have body temperatures in this interval.
After this first look at the data, let us return to the question of interest: Is the true average body temperature 37°C? One way of answering this is to turn the question around, and start with assuming that 37°C is actually the true, average temperature. The question then becomes, what would samples of size 130 from such a hypothetical population look like? This can be answered with statistical theory, or – in this computerized age – with simulated samples based on random number generation. Specifically, let us assume that this hypothetical population has the distribution of body temperatures shown below:
Imagine that we next draw 1,000 samples, each consisting of 130 individuals. What would their means look like? This is shown here:
Again we see an approximately bell-shaped histogram. The averages of the samples are centered around 37°C (like the hypothetical distribution was), but with a much smaller spread than before. All of the one thousand averages lie between 36.89°C and 37.10°C. This is the first important observation we can make: The variability of an average is smaller (usually much smaller) than the variability of a single observation. We shall later see, that an explicit formula exists describing this reduction in variability. The second important observation, is that none of the one thousand samples had a mean equal to 36.805°C or less (remember that 36.805°C was the actual mean observed on the "real" individuals). It is thus extremely unlikely to get a sample mean of 36.805°C, had the true average temperature in the population been 37°C. This is what is known as a statistical significant result, which is tecnical jargon for stating that the actually observed data is very different from what would be expected, had the investigated hypothesis (is the true average of body temperatures 37°C?) been true. The answer to our question is thus that while we cannot completely rule that the true average body temperature is 37°, it would be extremely unlikely to observe what we actually did observe, had the true average been 37°. Hence, we conclude that the true average body temperature is not 37°, ie. we reject this hypothesis.
So it was not 37° which is actually not a very informative finding. The far more interesting question is, what the true average body temperature could then be, judged from what we observed in this sample. This is typically answered with a so-called confidence interval. The fundamental rationale behind a confidence interval is to use the data to obtain the best possible (ie. narrow) interval where the true value of a parameter can be expected to be found with a given probability, typically 95%. Formulas for computing the interval depend on type of data and the way they were sampled, but we will defer this to later Chapters and for now concentrate on their interpretation.
Hence let us again consider the hypothetic population described above with mean 37°C and an SD of 0.407°C. In the Figure below we see the result of computing 95% confidence intervals for 50 random samples from this hypothetical distribution. Each interval is represented by a dot indicating the mean of the specific sample while the vertical lines indicate the confidence interval computed for the sample.
Note that of the 50 confidence intervals, there are only two which do not contain the true value of 37°C, the true value for this hypothetical population. This matches rather closely with the definition, stating that 95% of such intervals should contain the true value. In other words 5% of the intervals should not contain the true value, which for 50 samples would correspond to 2.5 of the intervals not covering the true value.
For our specific sample based on real observations the 95% confidence interval is (36.73°C; 36.88°C). The main conclusion regarding the average human body temperature based on this dataset, is thus that the best estimate of the true mean is 36.81°C and we are 95% confident that the true value is somewhere in the range of (36.73°C; 36.88°C). It is the latter part of the statement, which is crucial to appreciating the value of a statistical analysis: If conducted properly it yields a measure of the uncertainty of the results, or in everyday words we have measured the likely distance to the true value, eventhough we don't know its exact value.