The Role of Error in Statistical Models

The United States has a long and troubled history of discrimination and repression based on race. Until 1865, slavery was widespread in the United States, with rich white people legally owning, oppressing, and abusing black people. For the next century between 1865 and 1964 the race-based segregation of schools, businesses, and other public places was legal, and in much of the country black Americans were prohibited from fully participating in society. Any American over age 50 today was born in a segregated country that did not give equal rights to its black citizens. It is not surprising that racial discrimination is still a major problem in America despite the election of America's first black President. After all, Barack Obama himself was born in a legally segregated America. One outcome of the long history of racial discrimination in America is a continuing wage gap between blacks and whites. Even blacks born long after the end of official segregation in American earn substantially lower salaries than whites of the same age. The race gap in wages can be illustrated using data from the 2008 US Survey of Income and Program Participation (SIPP). Wave 2 of the 2008 SIPP includes wage income data for 4964 employed Americans aged 20-29 (633 of them black and 4331 of them white). The overall mean income in the SIPP sample is $36,633 with a standard deviation of $29,341. The conditional mean incomes of the 633 blacks and 4331 whites in the SIPP sample are reported in Figure 5-1. The black mean is $6656 lower than the white mean.

Figure 5-1. Means and standard deviations of wage income for employed twentysomething Americans by race, 2008 (SIPP data)

In a mean model for black wage income, the expected value of wage income for twentysomething black Americans is $30,826. Observed incomes less than or greater than $30,826 would be error from the standpoint of this model. The standard deviation of the error ($22,723) indicates that there is a wide spread in the actual incomes of black Americans. The expected values of wage income for twentysomething white Americans is $37,482. The standard deviation of the model ($30,096) indicates an even wider spread in incomes for white Americans than for black Americans. The mean model for black Americans uses one parameter (its mean) and is based on 633 cases, so it has 632 degrees of freedom. The mean model for white Americans has 4331 data points and 1 parameter, so it has 4330 degrees of freedom. Both models have plenty of degrees of freedom (anything more than 10 or so is fine). Another way to model the difference in incomes between black and white Americans would be to use a regression model. The coefficients of the regression of income on race are reported in Figure 5-2. In this regression model, the independent variable is race (coded as "blackness": 0 for whites and 1 for blacks) and the dependent variable is wage income. The regression model has an intercept of 37482 and a slope of -6656. In other words, the equation for the regression line is Income = 37482 - 6656 x Black. For whites (Black = 0), the expected value of wage income is 37842 - 6656 x 0 = 37482 + 0 = $37,842. For blacks (Black = 1), the expected value of wage income is 37842 + 6656 x 1 = 33876 - 6656 = $30,826. These expected values from the regression model are identical to the conditional means from the two mean models in Figure 5-1.

Figure 5-2. Regression of wage income on race for twentysomething Americans, 2008 (SIPP data)

The regression model uses all 4964 cases and has 2 parameters, leaving it with 4962 degrees of freedom. The regression error standard deviation in the regression model is $29,263 (regression model standard deviations usually aren't reported in tables of results, but they can be computed using statistical software programs). The slope of the regression line represents the race gap in incomes. The fact that the slope is negative means that the blacks in the SIPP sample reported earning less money than the whites in the SIPP sample. Does this mean that racial discrimination is still going on? That's hard to say. The high regression error standard deviation means that there is a lot of variability in people's incomes that is not captured by the regression model. The observed race gap of $6656 seems pretty big, but further analysis will be needed to determine whether or not it truly represents real racial differences in American society.

This chapter introduces the concept of statistical inference in the context of mean and regression models. First, inferential statistics are used to make conclusions about the social world as a whole (Section 5.1). This contrasts with descriptive statistics, which merely describe the data that are actually observed and recorded in databases. Second, all inferential statistics are based on the idea that the model error represented in the observed data are a random sample from all the errors that could have happened in the real world (Section 5.2). Different kinds of non-random sampling have different effects on model parameters. Third, all parameters estimated in statistical models are associated with error (Section 5.3). Error in the estimation of a parameter is called standard error. An optional section (Section 5.4) explores how sample size is related to the power of a statistical model to make inferences about the world. Finally, this chapter ends with an applied case study of how well rich countries are meeting their obligations under the Monterrey Consensus on aid for poor countries (Section 5.5). This case study illustrates how standard errors can be used to make inferences in statistical models. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should be able to make informed inferences about parameters like means and regression slopes and to use these inferences to more accurately describe the social world.

5.1. From descriptive statistics to inferential statistics Like most people around the world, Americans are getting fatter. This is a serious problem, because obesity is closely linked to a range of health problems including diabetes, joint problems, and heart disease. Many people also consider obesity unattractive, and want to weigh less than they do. According to data from the US Health and Nutrition Examination Survey (NHANES), the mean weight of Americans aged 20-29 is 155.9 lbs. for women and 188.3 lbs. for men. These figures are up dramatically from the first time the NHANES was conducted. Then, in the early 1960s, the means for American women and men in the 20s were 127.7 lbs. for women and 163.9 lbs for men. Means and standard deviations for the weights of American twentysomethings, broken down by gender, are reported in Figure 5-3.

Figure 5-3. Weight in pounds of Americans aged 20-29 (NHANES data)

Clearly, the 672 women surveyed in the 1960-1964 NHANES recorded much lower weights than the 706 women surveyed in the 2003-2006 NHANES. Does that mean that women really were lighter in the 1960s? It probably does, but both means are associated with large amounts of error. There is error in the mean model because every person in the NHANES database deviates from the national mean for various different reasons. Potential reasons why an individual person's weight might deviate from the mean weight for people of the same gender nationwide might include things like: A person's height How much a person eats How much a person exercises A person's genetic tendency to store energy as fat The 672 women represented in the first column of Figure 5-3 had a mean weight of 127.7 lbs. Of course, they did didn't all weigh 127.7 lbs. Even in the early 1960s, not everyone looked like Marilyn Monroe. Given the standard deviation of 23.3 lbs. reported in Figure 5-4, most women in their 20s would have weighed between 104.4 and 151.0 pounds. A made-up sample of some of the 672 women from the 1960-1962 NHANES and the reasons why they might have deviated from the national mean weight is presented in Figure 5-4. In reality, each woman would have had hundreds or thousands of individual reasons for deviating from the mean. Everything we eat or drink, every step we take, and even the amount of time we spend sleeping can affect our weight. Even a woman who weighs exactly the mean weight might have reasons for being heavier than the mean and reasons for weighing less than the mean that just happen to cancel each other out.

Figure 5-4. Illustration of potential reasons why women surveyed in the 1960-1962 NHANES may have deviated from the mean national weight

The mean and standard deviation of women's weights are compared to the mean and standard deviation of the error in the mean model for weight in Figure 5-5. The only difference between the two sides of Figure 5-5 is the scale. On the left side, the women's weights are spread around the mean (127.7 lbs.). On the right side, the women's weights are spread around 0. The amount of spread in both cases is the same (standard deviation = 23.3 lbs.).

Figure 5-5. Comparison of the standard deviation of weight (left side) and the standard deviation of error from the mean weight (right side) for 13 illustrative women

The mean model used to describe women's weights summarizes the characteristics of the data we actually have on weights in a simple descriptive model. Descriptive statistics is the use of statistics to describe the data we actually have in hand. A mean model of women's weights tells us the observed mean weight of the specific women in our databases. Similarly, regression models tell us the observed slopes and intercepts of regression lines for the data in our databases. These means, slopes, and intercepts are the parameters of models as observed using actual data. Observed parameters are the actually observed values of parameters like means, intercepts, and slopes based on the data we actually have in hand. Descriptive statistics is focused on finding and reporting observed parameters. It may seem like finding and reporting observed parameters is what statistics is all about, but the fact is that observed parameters are only the beginning of the story. We're not really interested in the actually observed weights of the 672 twentysomething American women who were included in the 1960-1962 NHANES database. What we're really interested in is making inferences about the true mean weight of American women in general of about the true difference between women's weights in 1960-1962 and women's weights in 2003-2006. Inferential statistics is the use of statistics to make conclusions about characteristics of the real world underlying our data. We've already used mean and regression models to make inferences about the real world, but we've done so informally. With the move from descriptive statistics to inferential statistics, we'll start using statistics to make formal inferences about characteristics of the real world behind our data. Observed parameters are descriptive statistics. They say something about the data themselves, but nothing about the larger world. They say that the weights of these 672 particular women averaged out to 127.7 lbs. on the particular days they were weighed using the particular scales in their particular doctors' offices. We can use this information to make inferences about the larger world, but it's like circumstantial evidence in a criminal case. After all, the NHANES was conducted over a three-year period, but every hour of every day you gain or lose weight. Your weight changes every time you eat or drink, or even breath. You're sweating, losing hairs, and shedding skin all the time. Your body structure is always changing as you gain or lose fat, muscle, or bone. In short, your weight is constantly changing. As a result, your observed weight at any one point in time is not the same thing as your "true" weight. True parameters are the true values of parameters like means, intercepts, and slopes based on the real (but unobserved) characteristics of the world. Your observed weight may be changing all the time, but still it tends to maintain roughly the same weight from month to month and year to year. At any one point in time there's a weight around which your body varies. This is your true weight. If you weighed yourself every hour on the hour for a whole year and took the mean of all these observed weights, the mean would be something like your true weight. The goal of inferential statistics is to make inferences about the true values of parameters. The observed values of parameters are a good guide to the likely true values of parameters, but observed parameters are always include some error. Inferential statistics is focused on understanding the amount of error in observed parameters. This amount is then used to make inferences about how much true parameters might differ from observed parameters. For example, the observed mean weight of American twentysomething women in 1960-1962 was 127.7 lbs. Is it possible that the true mean weight of American twentysomething women in 196-1962 was 128 lbs.? Maybe. Is it possible that their true mean was 130 lbs.? Unlikely. Is it possible that their true mean was 155.9 lbs., the same as women in 2003-2006? Impossible. Inferential statistics will allow us to make conclusions like this with confidence.

5.2. Types of error The island of Taiwan has had a difficult history. Long a part of China, it was subjected to 50 years of Japanese occupation from 1895-1945. Then in 1949 1.5 million mainland Chinese refugees from the Communist takeover of China flooded into Taiwan, swelling the population from 6 million to 7.5 million in one year. From 1950 through 1991 Taiwan was ruled by a military government that was dominated mainly by Chinese who had fled to the island in 1949. In short, for nearly a century before 1991 Taiwan was ruled by one form of dictatorship or another. No one alive in Taiwan today ever experienced democracy before the first free elections in 1991. As a result, younger Taiwanese have grown up under democracy, but older Taiwanese have strong memories of living under dictatorship. Are Taiwanese people today happy with the state of their democracy? Everywhere in the world social scientists find that people desire more democracy than they feel they have. The difference between people's desire for democracy and people's perception of how much democracy they actually have is called the "democratic deficit." Like people around the world, people in Taiwan feel that they do not have a democracy. People's ratings of democracy in Taiwan can be studied using data from the World Values Survey (WVS), which was conducted in Taiwan in 2006. The democracy rating has been scored on a scale from 0 to 100 where: Rating = 0 means the respondent thinks there is not enough democracy in Taiwan Rating = 50 means the respondent thinks there is just the right amount of democracy in Taiwan Rating = 100 means the respondent thinks there is too much democracy in Taiwan The results of a mean model for the democracy rating in Taiwan are summarized in Figure 5-6. The mean rating of 38.8 indicates that most people in Taiwan think there is less democracy than they would like, just as in the rest of the world. Since the democracy rating score is less than 50, there is a democratic deficit in Taiwan. Of course, not everyone in Taiwan feels this way. The standard deviation of 14.1 indicates that there is a wide spread in attitudes toward democracy. Still, the deficit makes it is pretty clear that Taiwanese people as a whole would like more democracy than they feel they have. The mean score (38.8 points) is almost one full standard deviation below 50.

Figure 5-6. Mean model for democracy rating in Taiwan, 2006 (WVS data)

In the mean model, each person in Taiwan is modeled as having a score of 38.8, plus or minus some deviation or error. This error is known as model error. It doesn't necessarily mean that there was a mistake in measuring someone's democracy rating. It means that the model gave an expected rating -- 38.8 -- that for many people was in error. Most people don't have a democracy rating of exactly 38.8. They have scores that are either lower or higher. These lower and higher scores average out to an observed mean of 38.8 points. The goal of the mean model summarized in Figure 5-6 is to find the true mean of how people in Taiwan feel about democracy. We don't know the true mean, but we do know that the observed mean is 38.8 on a scale from 0 to 100. The observed mean might differ from the true mean due to error. Broadly speaking, there are three different types of model error in a mean model: Measurement error Sampling error Case-specific error Measurement error is error resulting from accidents, mistakes, or misunderstandings in the measurement of a variable. For example, a respondent might mark the wrong oval on a survey form, or a question might be badly worded. Respondents might not remember the answer to a question, or might misunderstand the question. In a telephone survey, the researcher might not hear the respondent correctly, or might type in the wrong answer. Accidents happen. Since the observed mean democracy rating is calculated from people's actual answers as recorded on the survey, it might differ from the true mean if these recorded answers are wrong. Sampling error is error resulting from the random chance of which research subjects are included in a sample. Taiwan today is home to 22.8 million people. Only 1216 of them were included in the survey. It is possible that these 1216 people are not truly representative of the Taiwanese population. Every person's rating of democracy in Taiwan is the result of millions of influences and experiences. Ideally, all of these typically Taiwanese experiences should be reflected in the people chosen for the survey. If the sum total of all these influences experienced by the people answering the survey differ from the sum total of the influences experienced by the population as a whole, the observed mean from the survey will differ from the true mean of the population as a whole. For example, the survey design might not include sampling for hospitalized or homeless people, and so their experiences would not be reflected in the observed mean. Case-specific error is error resulting from any of the millions of influences and experiences that may cause a specific case to have a value that is different from its expected value. Most of the error in any statistical model is case-specific error. Each person's unique experience of the world determines that person's views on subjects like democracy. Since everyone has a different experience of the world, everyone differs from the mean for different reasons and in different ways. People with different identities, backgrounds, or even moods the day the question is asked will give different answers. Since these characteristics of people are always changing, the observed mean at any one time may differ from the true mean of the research subjects in the study. Case-specific error is so large because every person's answer to any question represents a kind of a random sample of all the potential influences that can possibly be experienced in a society. In the mean model, the results of all of these different and unique experiences are lumped together into a the model error. Linear regression models, on the other hand, take some of those unique experiences and bring them into the model. The independent variable in a regression model represents some part of what makes each case unique. For example, one thing that shapes people's views on democracy is their age. Older Taiwanese people grew up under a military dictatorship. We might theorize that people who grew up under a military dictatorship would be thankful for any kind of democracy. One hypothesis based on this theory would be that older people would rate Taiwan's democracy more highly than younger people. The results of a linear regression model using age as the independent variable and democracy rating as the dependent variable are reported in Figure 5-7.

Figure 5-7. Regression of democracy rating on age in Taiwan, 2006 (WVS)

The slope reported in Figure 5-7 is positive. Each additional year of age is associated with a rise of 0.105 in the expected value of a person's democracy rating. Using the coefficients in Figure 5-7, we could calculate the expected value of a 20 year old Taiwanese person's rating of Taiwan's democracy as 34.223 + 20 x 0.105 = 36.323 on a scale from 0 to 100. The expected democracy rating for a 60 year old Taiwanese would be 34.223 + 60 x 0.105 = 40.523, or about 4 points higher. That's not a lot, but it does tend to confirm the theory that age affects people's ratings of democracy in Taiwan. At least part of the case-specific error in Taiwanese democracy ratings can be traced back to age. In fact, one way to think about what regression models do is to think of them as explaining part of the case-specific error in a mean model. This is very clearly illustrated in Figure 4-10 and Figure 4-16 in Chapter 4. In Figure 4-10, a large part of the case-specific error in smoking rates in the mean model for Canadian provinces (left side of the figure) was attributed to the average temperature in each province (right side). The standard deviation of the error in the mean model was 5.3%. After taking temperature into account, the standard deviation of the error in the regression model was just 3.8%. A big chunk of the case-specific error in the mean model disappeared in the regression model. This error that disappeared was the error due to the differences in temperature across Canadian provinces. In the example of democracy ratings in Taiwan, the mean model has an error standard deviation of 14.1 (on a scale from 0 to 100). The regression model error standard deviation (not reported in the regression table) is 14.0 (on a scale from 0 to 100). A very small portion (0.1) of the case-specific error in Taiwanese democracy ratings is due to age. It is small because the effect of age reported in the regression model (Figure 5-7) is very small. Age isn't a big determinant of democracy ratings in Taiwan, but it is a factor. It's a small part of what makes people differ from the overall mean for Taiwan. Measurement error, sampling error, and case-specific error can be present in any statistical model, but most of inferential statistics focuses on case-specific error. Regression models in particular focus on attributing part of the case-specific error in dependent variables to the research subjects' scores on the independent variables. Measurement error and sampling error do affect regression models, but in very subtle ways. These are discussed in Chapter 12. Until then, when discussing model error we will focus exclusively on case-specific error.

5.3. The standard error of a parameter The large amount of error in statistical models can make it difficult to make inferences. Returning to the example of the race gap in wages (Figure 5-1), can we have any confidence that the true means of black and white wages are at all close to the observed means of $22,723 and $30,096? On the one hand, there is a very large amount of error in these mean models. On the other hand, the means in both models are based on very large samples of cases (633 blacks and 4331 whites). When a model is estimated using a large number of cases, the case-specific errors tend to cancel each other out. There may be an enormous amount of case-specific error (as in Figure 5-1), but if all the positive errors are balanced by negative errors, the observed mean might be very close to the true mean. Error is only a problem if, just by chance, there is too much positive error or two much negative error. The power of large numbers of cases to even out errors and produce a more accurate observed mean can be illustrated using the sample data on the weights of American women presented in Figure 5-4. Imagine if we tried to calculate the mean weight of American women in the 1960s using the weight of just one random woman. We might pick woman 3 and get a mean weight of 140.0 lbs. or woman 6 and get a mean weight of 115.6 lbs. If we based our mean model on just one woman's weight, there would be a lot of error in our observed mean. In fact, using just one case to calculate a mean in a mean model would give a range of means that had exactly the spread as the women's weights themselves. The mean calculated based on one case could be anything from 99.5 lbs. (the weight of woman 4 in Figure 5-4) to 177.7 lbs. (the weight of woman 9). A mean model estimated using just two cases would give a much more accurate observed means. The two lightest women in Figure 5-4 weigh 99.5 lbs. (woman 4) and 109.1 lbs. (woman 5). The mean of these two cases is 104.3 lbs. The mean of the two heaviest women (women 3 and 9) is 158.85 lbs. Thus a mean model based on any two random cases from Figure 5-4 would come up with an observed mean somewhere between 104.3 and 158.85 lbs. This compares to a range for one case of between 99.5 lbs. and 177.7 lbs. The range of possible means is narrower for two cases than for one case. For three cases, it would be even narrower. Once you get up to 672 cases, case-specific error is almost guaranteed to average out across all the cases. It turns out that the accuracy of parameters like means, slopes, and intercepts increases rapidly as more and more cases are used in their estimation. As sample sizes get larger, the observed levels of parameters get closer and closer to their true levels. There is always the potential for error in the observed parameters, because there is always case-specific error in the variables used in the models. Nonetheless, when models use large numbers of cases, the amount of error in observed parameters can be very small. Standard error is a measure of the amount of error associated with an observed parameter. The standard error of an observed parameter tells us how close it is likely to be to the true parameter. This is extremely important, because it enables us to make inferences about the levels of true parameters like means, slopes, and intercepts. Standard error depends on the number of cases used and on the overall amount of error in the model. Standard error is easy to calculate in mean models, but follows a more complicated formula in regression models. The calculation of standard error is covered in Section 5.4. As with the standard deviations of variables, statistical software programs routinely calculate the standard errors of all parameters. For the purpose of understanding where standard errors come from, it's enough to know that as the numbers of cases go up, the standard errors of parameters go down. Smaller standard errors mean that observed parameters more accurately reflects true parameters. Returning to the race gap in income (Figure 5-1), the observed mean income for twentysomething blacks was $30,826. This mean model had a very high error standard deviation ($22,723). It turns out that the standard error of the observed mean in this model is just $903. The standard error of a parameter can be interpreted in roughly the same way as the standard deviation of a variable: most of the time, the true mean is somewhere within one or two standard errors of the observed mean. So in Figure 5-1 the observed mean income for blacks is $30,826 with a standard error of $903. This implies that the true mean income for blacks is probably somewhere in the neighborhood of $29,900 to $31,700. The standard error of the mean income of whites is even smaller. Because of the large number of cases for whites (4331) the standard error of the mean is just $457. The regression of income on race in Figure 5-2 reported a slope of -6656, meaning that the observed race gap in income was $6656. The regression model had a very high level of error (regression error standard deviation = $29,263). Nonetheless, the standard error of the slope is just $1245. This means that the true race gap in income is likely somewhere between $5400 and $7900. The true race gap might be equal to exactly $6656 (the observed gap), but it probably isn't. Nonetheless, it's probably close. Based on the standard error of $1245, we can infer that it almost certainly isn't $0. In other words, we can infer that the race gap in incomes really exists. It's not just a result of random error in our data.

5.4. Sample size and statistical power (optional/advanced) The calculation of the standard error of a mean in a mean model is relatively straightforward. It is equal to the standard deviation of the variable divided by the square root of the number of cases. The calculation of the standard error of a regression slope is much more complicated. Like the standard error of a mean, it depends on the regression error standard deviation and the number of cases, but it also depends on the amount of spread in the independent variable. From a conceptual standpoint, the standard error of the slope is like stretching out the standard error of the mean of the dependent variable over the range of the independent variable, much like the values of the independent variable are spread over the range of the independent variable in Figure 4-10. The calculation of the standard error of a regression intercept is even more complicated. With all parameters, though, the standard error declines with the square root of the number of cases. This means that you can make more accurate inferences when you have more cases to work with. Because of the square root relationship, the number of cases is usually more important than the amount of model error for achieving a low standard error. Even models with enormous amounts of error (like the regression of Taiwanese democracy ratings on age) can have very low standard errors for their parameters, with enough cases. The relationship between the number of cases used in a mean model (N) and the standard error of the observed mean (SE) is depicted graphically Figure 5-8. The line on the graph can be read as the standard error of the mean of a variable when the standard deviation of the variable is equal to 1. The standard error of a mean goes down very rapidly as the number of cases rises from 1 to 20. Between 20 and 100 cases the standard error of a mean also declines rapidly, but not so steeply as before. After about 100 cases the standard error of a mean continues to fall, but at a very slow pace. Broadly speaking, once you have 1000 or so cases in-hand, enormous numbers of additional cases are needed to make any real difference to the standard errors of the mean. Sample sizes of N = 800 - 1,000 cases are sufficient for most social science applications.

Figure 5-8. Relationship between number of cases and the standard error of the mean

In the regression model for Taiwanese democracy ratings (Figure 5-7) the observed slope was just 0.105, meaning that every extra year of age was associated with a 0.105 point increase in a person's democracy rating. We found that only a tiny portion (0.1 point out of 14.1 points) of the total case-specific error in people's democracy ratings could be attributed to age. Nonetheless, due to the large number of cases used in the model (1216 people), the standard error of the observed regression slope is just 0.25 points. Based on this figure, we can infer that the true effect of a year of age on people's democracy ratings likely lies somewhere between (roughly) 0.080 and 0.130. In other words, we can infer that the true effect is age is almost certainly not 0. Despite the enormous amount of error in the regression model, we can still make conclusions about how attitudes change with age with confidence. This ability to make conclusions about a true mean using an estimate of the mean based on real data is called the power of a statistical model. The power of any statistical model rises as the number of cases rises both because more cases means lower standard errors and (much less importantly) because more cases means more degrees of freedom in the model, and thus smaller error standard deviations. Both of these contributions to the power of a statistical model show diminishing returns once the sample size reaches 1,000 or so cases. Since most quantitative research in the social sciences is based on survey data and most surveys cost a fixes amount of time and money for each additional respondent, most studies are based on 800 or so cases. After surveys have about 800 respondents, they add very little additional power for each additional person.

5.5. Case study: Aid generosity and the Monterrey Consensus At the 2002 United Nations International Conference on Financing for Development in Monterrey, Mexico the rich countries of the world made a commitment to raise their levels of foreign aid to 0.70% of their national income levels. Many of the richest countries in the world have done this. Figure 5-17 shows levels of rich countries' overseas development assistance (ODA) spending as a proportion of national income for 20 rich countries. Each country's level of foreign aid is represented by a bar. Descriptive statistics can be used to describe the observed distribution of ODA spending. The observed mean level of aid across all 20 countries is 0.52% of national income. This and the Monterrey target of 0.70% of national income are marked on the chart. The observed mean is 0.18% less than the target. Though the observed mean is well below the target level, is it possible that the true mean of ODA spending as a percent of national income might really be equal to 0.7%?

Figure 5-9. Levels of overseas development assistance (ODA) for 20 rich countries, 2008 (OECD data)

The observed mean may differ from the true mean for a variety of reasons. Although the observed mean level of aid spending across all 20 countries is less than the target level of 0.70%, 5 countries have aid levels above the target and one more comes reasonably close to the target. If all 20 countries were targeting aid levels of 0.70%, it seems possible that 5 would overshoot, 1 could come close, and 13 would undershoot the target. There may be measurement error in countries' reported levels of ODA spending due to poor accounting practices or researcher mistakes. More likely, there is probably a lot of case-specific error. Countries may have targeted 0.7% but underspent due to the recession or overspent due to emergency spending on a humanitarian crisis. There's no sampling error in this example because the data represent all the world's richest countries, not a sample of rich countries. The standard deviation of ODA spending is 0.27%. There are 20 countries in the analysis. These two figures can be used to calculate the standard error of the mean of ODA spending, which comes out to 0.06%. Based on this standard error, inferential statistics can be used to make inferences about the true mean level of ODA spending. A standard error of 0.06% means that the true mean level of ODA spending is probably in the range of 0.46% to 0.58% (plus or minus one standard error from the observed mean). It is possible that the true mean is even farther from the observed mean, but it is very unlikely that the true mean is 0.70%. The Monterrey target of 0.70% is a full three standard errors away from the observed mean of 0.52%. The true mean level of ODA spending may not be 0.52%, but it is almost certainly not 0.70%. The rich countries of the world must increase ODA spending dramatically in order to meet their Monterrey obligations.

Chapter 5 Key Terms

Case-specific error is error resulting from any of the millions of influences and experiences that may cause a specific case to have a value that is different from its expected value.
Descriptive statistics is the use of statistics to describe the data we actually have in hand.
Inferential statistics is the use of statistics to make conclusions about characteristics of the real world underlying our data.
Measurement error is error resulting from accidents, mistakes, or misunderstandings in the measurement of a variable.
Observed parameters are the actually observed values of parameters like means, intercepts, and slopes based on the data we actually have in hand.
Sampling error is error resulting from the random chance of which research subjects are included in a sample.
Standard error is a measure of the amount of error associated with an observed parameter.
True parameters are the true values of parameters like means, intercepts, and slopes based on the real (but unobserved) characteristics of the world.

← Chapter 4 · Chapter 6 →

Social Statistics/Chapter 5

The Role of Error in Statistical Models

Chapter 5 Key Terms