# Introduction to Multiple Linear Regression

Parents and politicians are forever convinced that their children are not getting a good enough education. In part to meet parents' and politicians' demands to compare educational outcomes across countries, the Organization for Economic Cooperation and Development (OECD) administers the Program for International Student Assessment (PISA). The program organizes standardized tests comparing the knowledge of 15 year old students across in OECD member and other countries. The PISA tests focus very heavily on the topics that parents and politicians seem to worry most about -- math and science -- while ignoring other subjects that are probably much more useful for ensuring a happy and successful life, like literature, the arts, and of course the social sciences. Nonetheless, countries' PISA test results can be used to help answer important social scientific questions about national educational outcomes. One concern that is often raised by parents and politicians is the small number of women who choose to enter scientific and engineering professions. This may or may not be a problem -- after all, saying that there's a shortage of women in the sciences is the same as saying that there's a shortage of men in other areas -- but it is widely perceived to be a problem. Many OECD countries (including the United States) have special government-funded programs to increase the numbers of girls who study science and the numbers of women who choose scientific careers. Parents and politicians are particularly concerned that teenage girls don't seem to do as well in science in high school as teenage boys. Do teenage girls really underperform teenage boys in science? Cross-national data from the PISA tests can be used to answer this question. Data on PISA science scores are reported in Figure 7-1.

In addition to the usual metadata items, Figure 7-1 contains seven variables: BOYS -- The national mean PISA science score for boys GIRLS -- The national mean PISA science score for boys GAP -- The gender gap in science education (BOYS - GIRL) INCOME -- National income per person in US Dollars SPEND -- Education spending as a percentage of total national income TEACHERS -- The number of teachers per 100 students The PISA scores are constructed to have a mean of 500 for the OECD as a whole. National scores above 500 are above the OECD mean, while national scores below 500 are below the OECD mean. Since each country has a science score for both girls and boys, the database in Figure 7-1 is a paired sample. The mean difference between boys' and girls' scores is 0.36 points (the boys' mean is 0.36 points higher than the girls' mean). This difference is associated with a t statistic of t = 0.31 with 44 degrees of freedom. Based on this t statistic, there is a probability of 0.759 that the true mean difference between boys and girls could be 0. Since this probability of 75.9% very high, we would infer that it is entirely possible that there is no true difference between boys' and girls' performance in science. Even though there is no evidence for a gender gap overall across the 45 countries, there are many individual countries that have large gender gaps. The thirteen countries with a gender gap greater than 5 points are listed in Figure 7-2. Policy makers in these countries might ask social scientists to explain the gender gap and then recommend policies that could help reduce it. Three theories that might explain the gender gap in science scores are: (1) Income -- Richer countries have greater gender equality and so girls in richer countries are encouraged to study science more than girls in poorer countries (2) Spending -- High levels of educational spending tend to even out performance for all students, while countries that spend very little on education may give preference to boys over girls (3) Teachers -- Girls tend more than boys to learn through personal interaction, so having more teachers and smaller class sizes benefits girls' education more than boys' education

Each of these theories can be operationalized into a specific hypothesis using the data reported in Figure 7-1. The income theory predicts that countries that have higher income levels will have smaller gender gaps (as incomes go up, the gap goes down). The spending theory predicts that countries that spend more on education will have smaller gender gaps (as spending goes up, the gap goes down). Finally, the teachers theory predicts that countries that have more teachers will have smaller gender gaps (as the number of teachers goes up, the gap goes down). The results of regression models associated with each of these hypotheses are reported in Figure 7-3.

The results completely contradict the income theory. The slope in Model 1 of Figure 7-3 indicates that richer countries actually have bigger gender gaps than poor countries (though the effect is not statistically significant). On the other hand, the results do tend to confirm the spending theory, but the effect of spending is not statistically significant. In Model 2, the probability of 0.693 indicates that there is a very large probability that the true effect of spending is 0. The only strong result in Figure 7-3 is the slope for teachers. According to the slope in model 3, every additional teacher per 100 students tends to reduce the gender gap by more than 1 point. This result is unlikely to have occurred by chance (probability less than 2.3%). The policy implication seems to be that more teachers are required if a country wants to reduce its gender gap in science education. Obviously, hiring more teachers costs money. Yet the relationship between spending and the gender gap is not significantly different from 0. Moreover, it's possible that only rich countries can afford to increase their spending on education. In short, it is difficult to change any one of these three determinants of the gender gap without changing the others at the same time. What we really need is an integrated model that take all three variables into account at the same time. For that, new statistical tools are required.

This chapter introduces the multiple linear regression model. First, there is no reason why a regression model can't have two, three, or even dozens of independent variables (Section 7.1). The potential number of independent variables is limited only by the degrees of freedom available, but if there are too many independent variables none of them will be statistically significant. Second, the slopes of multiple regression models represent the independent effects of all of the independent variables on the dependent variable (Section 7.2). Regression models are often used to study the effect of one independent variable while "controlling for" the effects of others. Third, like any statistical model, multiple regression models can be used to predict values of the dependent variable (Section 7.3). Prediction in multiple regression works exactly the same way as when there is only one independent variable, just with additional variables. An optional section (Section 7.4) explains how control variables can be used to reduce the amount of error in regression models and thus indirectly boost the significance of regression coefficients. Finally, this chapter ends with an applied case study of the determinants of child mortality rates in sub-Saharan African countries (Section 7.5). This case study illustrates how regression coefficients can either increase or decrease when additional variables are added to a regression model. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should be able to use multiple regression to make basic inferences about the effects of multiple independent variables on a single dependent variable.

7.1. The multiple regression model Social scientists often have many competing theories to explain the same phenomenon. The gender gap in education might be due to national income, spending, or teachers. Countries' foreign aid spending levels might depend on their national incomes, European status, or aid efficiency levels. People's incomes might depend on their ages, races, genders, and levels of education. Moreover, these theories are not mutually exclusive. People's incomes differ by both race and gender, not one or the other. Most outcomes in the social sciences are the result of multiple causes. Models designed to study them must have multiple causes as well. Multicausal models are statistical models that have one dependent variable but two or more independent variables. Though many different kinds of multicausal models are possible, the most commonly used multicausal model is a straightforward extension of the linear regression model. Multiple linear regression models are statistical models in which expected values of the dependent variable are thought to rise or fall in a straight lines according to values of two or more independent variables. Multiple regression models work the same way as simple linear regression models except that they have additional independent variables. They produce expected values that are the values that a dependent variable would be expected to have based solely on values of the independent variables. They do this by determining the combination regression coefficients (slopes and intercepts) that minimizes the regression error standard deviation. In effect, multiple regression models take the observed values of the dependent variable and spread them out according to the values of two or more independent variables at the same time. An example of a social science phenomenon that has multiple causes is foreign aid. A multiple linear regression model for Official Development Assistance (ODA) spending is presented in Figure 7-4. This model integrates the three ODA spending models that were presented in Figure 6-9: one based on income (Model 1), one based on European status (Model 2), and one based on administrative costs (Model 3). In the simple linear regression models, both national income and European country status were found to be significantly related to ODA spending levels (the effect of administrative costs was non-significant). The multiple linear regression model (Model 4) spreads the total variability in ODA levels in the 20 rich countries across all three explanations at the same time. The coefficients in Model 4 represent the unique combination of coefficients that result in the smallest possible regression error standard deviation for the model as a whole.

In Model 4, the slope for national income is slightly smaller than it was in Model 1 (0.010 versus 0.013). Although it is smaller, it is still statistically significant (probability = .007 or 0.7%). As countries get richer, they give more of their national incomes in ODA spending. The slope for European status has also declined in Model 4, but much more so (from 0.328 to 0.199). The new, smaller slope for European status is no longer statistically significant (probability = 0.128 or 12.8%). European countries still have observed ODA spending levels that are 0.199% higher than non-European countries, but this difference is not statistically significant. In other words, the results reported in Model 4 indicate that the ODA spending difference between European and non-European countries could be due to random chance error. In Model 4 as in Model 3 administrative costs have no measurable impact on ODA spending. The mean level of ODA spending across all 20 countries is 0.52% of national income with a standard deviation of 0.268%. The regression error standard deviation for Model 4 is 0.185%. The multiple regression model has substantially less error than the simple mean model. A portion of countries' overall deviations from the mean spending level of 0.52% can be traced to countries' European status (European or non-European), but much more can be traced to countries' national income levels (rich versus poor). The contrast between the coefficients of European status in Model 2 versus Model 4 indicates that part of the difference between European and non-European countries' levels of ODA spending is due to the fact that European countries tend to be richer than non-European countries. This is illustrated in Figure 7-5.

The mean level of ODA spending for European countries is much higher than the mean level for non-European countries, but so is the mean level of national income. Do European countries spend so much on foreign aid because they're European, or because they're rich? The multiple regression model suggests that the true answer is a combination of the two explanations. European countries do spend a lot on ODA just like other rich countries do, but they spend even more than would be expected just based on their national income levels. How much more? The best estimate is that European countries spend 0.199% more of their national income on ODA than do other countries of similar income levels. This figure comes from the coefficient for European status in Model 4. The difference of 0.199% is not statistically significantly different from 0%, but it is still the best estimate of the difference. In other words, out best guess is that being European makes a country spend 0.199% more on aid than it otherwise would based on its income level alone. Just as European countries may spend more on aid because they have higher incomes, it is possible that higher income countries spend more on aid in part because many of them are European. In Figure 7-4, the slope for national income is 0.013 in Model 1, but this drops to 0.010 in Model 4. The slope for national income is lower in the multiple regression model (Model 4) than in the simple linear regression model (Model 1) because in the multiple regression model the total variability in ODA spending levels is split between national income and European status. Ultimately, what multiple linear regression does is split the total variability in the dependent variable among all the independent variables. In essence, the multiple independent variables are all competing for the same available variability. This usually (but not always) shows up as smaller slopes in the multiple regression model. In Figure 7-3 three different independent variables were used to explain the gender gap in science scores in three separate linear regression models. The three independent variables were national income, educational spending, and teachers per 100 students. Figure 7-6 presents a multiple linear regression model of the gender gap in science that uses all three variables. The slopes in Figure 7-6 are actually stronger, not weaker, than those from the original three models. This can only happen when the multiple independent variables complement each other, capturing different aspects of the dependent variable. Some countries have large gender gaps because they have high incomes but also have small gender gaps because they have lots of teachers. In the simple linear regression models these two effects cancel each other out, but in the multiple linear regression model the two separate effects are revealed.

Multiple liner regression is by far the most commonly used statistical model in the social sciences. It summarizes an enormous amount of information about how variables are related in a very compact space. Multiple regression tables always report the model intercept and the slopes of each of the independent variables. Sometimes they report the standard errors of the coefficients, sometimes the t statistics, and sometimes the probabilities of the t statistics. When social scientists want to report a large number of results in a single table, they report only the coefficients and use footnotes to indicate the probabilities of their associated t statistics, as illustrated in Figure 7-7. Because multiple regression tables contain so much information, an entire paper can be written around a single table of results. In short, multiple linear regression analysis is the workhorse method of social statistics.

7.2. Prediction using multiple regression Multiple linear regression models can be used to calculate predicted values of dependent variables in exactly the same way as simple linear regression. Since multiple linear regression models include more predictors than simple regression models, they tend to produce more accurate predictions. Predictors are the independent variables in regression models. A multiple regression model using four predictors to predict the incomes of employed American twentysomethings is presented in Figure 7-8. All four predictors (age, race, gender, and education) have highly significant slopes. Based on the t statistics, race is the least important of the four independent variables, but even race is highly significant statistically. Note that the intercept of this model is not very meaningful on its own, but is nonetheless necessary for calculating predicted values.

The equation for wage income based on the regression coefficients reported in Figure 7-8 is spelled out in Figure 7-9. Predicted income starts out at -\$68,933 for a black female age 0 who has no education. Of course, this is a meaningless extrapolation of the regression analysis: newborn babies don't have incomes or education. Nonetheless, it is the starting point for calculating predicted values. Starting at -\$68,933, each additional year of age brings \$1843 in income, being white adds \$4901 to a person's predicted income, being male adds \$7625 to a person's predicted income, and each additional year of education brings \$3599 in income. Using the equation in Figure 7-9, the incomes of any white or black Americans in their twenties can be predicted. The predictions may not be correct, but they will be more correct than simply predicting people's incomes based on the mean income for American twentysomethings.

The calculations for predicted wage income levels for 10 American twentysomethings are illustrated in Figure 7-10. The values in the table illustrate how a single regression model (Figure 7-8) can produce a very wide variety of predictions. The predicted incomes range from \$21,885 for a 21 year old white male high school dropout to \$61,822 for a 29 year old white male with an MBA. Lower and higher incomes are also possible. For example, a 21 year old black female high school dropout would be expected to earn just \$9,359 per year. This is below the US minimum wage for a full-time worker, but the SIPP data are based on all employed people, including part-time employees. As predicted by the regression model, a 21 year old high school dropout might have trouble finding a full-time job.

Of course, most people have incomes that are very different from their predicted values. How different? Figure 7-11 reports the model error standard deviation for six different ways of predicting people's incomes. In the mean model, each person's income is predicted using the observed mean income for all 4964 American twentysomethings in the sample. The four simple regression models each use a single independent variable to calculate predicted values for income, while the multiple regression model uses all four independent variables together. The error standard deviation is based on the deviation from their expected incomes in each model for all 4964 people. The multiple linear regression model has less model error than any of the other models, but not much less. Even knowing people's age, race, gender, and education, it is very difficult to predict their incomes with accuracy.

Controlling for European status reduces the slope for National income because the variable European status competes with the variable National income in explaining levels of ODA spending across rich countries. Competing controls are control variables that compete with an independent variable of interest by splitting its explanatory power in a multiple regression model. From the standpoint of National income, European status is a competing control. On the other hand, from the standpoint of European status, National income is a competing control. They both compete to explain the same fact, that rich European countries have higher ODA spending than other countries. This was illustrated in Figure 7-5. The fact that the coefficient for National income remains significant in Model 3 while the coefficient for European status is not significant suggests that National income is the stronger of the two predictors of ODA spending. Any independent variable in a multiple regression model can be thought of as a control variable from the perspective of other independent variables. Whether or not a variable should be thought of as a control variable is up to the judgment of the researcher. If a variable is used with the intent that it should be held constant in order to bring out the true effect of another variable, it is a control variable. If a variable is of interest in its own right, then it is not. From a purely statistical standpoint, every independent variable in a multiple regression model is a control variable for all the other variables in the model. From a social science standpoint, a variable is a control variable if the researcher thinks it is, and if not, not.

7.4. Controlling for error (optional/advanced) Control variables are usually used to hold constant or control for one variable in trying to understand the true effect of another. Depending on the situation, the control variable might have no effect on the observed coefficient of the variable of interest, or it might complement or compete with the variable of interest. In all of these situations, the impact of the control variable is straightforward and easy to see: the observed slope for the variable of interest changes (or in the case of an ineffectual control variable, doesn't change) in response to the inclusion of the control variable. It may seem like these three possibilities (complement, compete, no effect) are the only possible effects of a control variable, but in fact there is one more way in which a control variable can affect a regression model. The control variable might reduce the amount of error in the model. Just such a situation is illustrated in Figure 7-13. Model 1 of Figure 7-13 repeats the regression of Canadian provincial smoking rates on average temperatures from Figure 4-8. Smoking rates fall with higher temperatures. Every 1 degree Fahrenheit increase in Temperature is associated with a 0.44% decline in smoking rates, and this result is highly significant statistically. Model 2 of Figure 7-13 takes the simple regression from Model 1 and adds a control for the Heavy drinking rate. Alcohol consumption is closely associated with smoking all across the rich countries of the world, including Canada (though, interestingly, in many poor countries it is not). Controlling for rates of Heavy drinking in Model 2 has no effect whatsoever on the slope for Temperature, which remains -0.44. It does, however, affect the standard error of the slope for Temperature.

In Model 1, the standard error of the slope for Temperature is 0.087, but the standard error declines to 0.062 in Model 2. The smaller standard error in Model 2 results in a larger t statistic. In this example, the effect of Temperature on the smoking rate is already highly significant (the probability that the true slope for Temperature is 0 is less than 0.001), so the higher t statistic doesn't change our interpretation of the model. Nonetheless, the slope for Temperature is more statistically significant in Model 2 than it is in Model 1. Why does the standard error of the slope go down when a control variable is introduced? Heavy drinking is completely unrelated to Temperature, but it is related to the smoking rate. In fact, Heavy drinking accounts for an important part of the total variability in smoking rates. As a result, there is less model error in Model 2 than in Model 1. Standard error is a function of the strength of the relationship between the independent variable and the dependent variable, the number of cases used to estimate the model, and the amount of error in the model. From Model 1 to Model 2 the strength of the relationship hasn't changed (it's still -0.44), the number of cases hasn't changed (it's still 13), and the amount of model error has declined (due to the effect of Heavy drinking). The net effect is that the standard error associated with Temperature has declined. Temperature is an even more significant predictor of smoking after controlling for Heavy drinking than it was before.

7.5. Case study: Child mortality in sub-Saharan Africa Out of every 1000 children born in Africa, only 850 live to their fifth birthdays. This mortality rate of 150 per 1000 children is shockingly high. By comparison, the child mortality rate in rich countries is typically around 5-6 per 1000 children. The United States has the highest child mortality rate in the developed world, with the loss of 7.7 out of 1000 children by age 5. Child mortality rates in African countries are typically 20 times as high. Child mortality and related statistics for 44 sub-Saharan African countries are reported in Figure 7-14. In addition to the metadata items, four variables are included: MORT -- The under-5 mortality rate per 1000 births INCOME -- National income per person in US Dollars FERT -- The fertility rate (mean births per woman of childbearing age) IMMUN -- The DPT (Diphtheria-Pertussis-Tetanus) childhood immunization rate A multicausal model of child mortality would predict that childhood mortality rates should decline with income (richer countries should have lower mortality), decline with immunization (countries with better immunization should have lower mortality), and rise with fertility (countries with more children should have higher mortality).

The results of three regression models to predict child mortality in sub-Saharan Africa are reported in Figure 7-15. Model 1 is a simple linear regression model with just one predictor, National income. Each additional \$1000 of national income is associated with a decline in the child mortality rate of 6.84 children per 1000 children born in the country. This result is highly significant statistically.

Models 2 and 3 of Figure 7-15 are multiple linear regression models. Model 2 introduces the DPT immunization rate as a control variable. The inclusion of DPT immunization actually increases the size of the slope for National income from 6.84 to 8.49. This indicates that DPT immunization is complementary to national income. Counter-intuitively, immunization rates in Africa fall as national income rises, in part due to parental resistance to immunization in the richer African countries. As a result, controlling for immunization reveals an even stronger impact of National income in reducing child mortality rates. Model 3 introduces the Fertility rate as a control variable. Controlling for the Fertility rate dramatically reduces the size of the slope for National income. In fact, the slope for National income in Model 3 is not significantly different from 0. Fertility strongly competes with National income as an explanation of child mortality rates. It also competes with DPT immunization. The slope for DPT immunization is much smaller in Model 3 than in Model 2, but it is still statistically significant. How can child mortality be reduced in Africa? Obviously, higher incomes wouldn't hurt, but Model 3 suggests that immunization and family planning would be much more effective in reducing child mortality. That's good news, because social scientists know much more about ways to improve immunization and family planning than about ways to raise incomes. Model 3 suggests that rich countries' official development assistance (ODA) spending should focus on expanding immunization and family planning programs to support African families in their efforts to improve their children's health.

## Chapter 7 Key Terms

• Complementary controls are control variables that complement an independent variable of interest by unmasking its explanatory power in a multiple regression model.
• Competing controls are control variables that compete with an independent variable of interest by splitting its explanatory power in a multiple regression model.
• Control variables are variables that are "held constant" in a multiple regression analysis in order to highlight the effect of a particular independent variable of interest.
• Multicausal models are statistical models that have one dependent variable but two or more independent variables.
• Multiple linear regression models are statistical models in which expected values of the dependent variable are thought to rise or fall in a straight lines according to values of two or more independent variables.
• Predictors are the independent variables in regression models.