Econometric Theory/Dummy Variables

Health insurance companies often charge differently for different types of people. They know off of their own data, that young adults are in less need of a doctor generally, and so they charge less for insurance. Their data shows that age and health costs are positively correlated and indeed one "causes" another. There are other demographics off of which they organize their prices of coverage. One is whether or not the customer is a smoker, and some even use gender. But how did they come to the conclusion that smokers cost more to cover, or that the cost of covering a man is different from a woman? These are not quantitative pieces of data, so they cannot be regressed, right? Well no, we can make it look like they are in fact quantitative, and not qualitative.

Dummy Variables

Dummy Variables or Indicator Variables are these qualitative data points manipulated to be quantitative. In the case of correlating health costs to smoking habits, we can say that a smoker is a 1 and a non-smoker is a 0. Our dependent variable is health care costs.

Our model will look like this: $Y_{i}=\alpha +\beta D_{i}+\epsilon _{i}$ where D is our dummy (Smoking) and Y is our Dependant (health care costs). Say that health care costs are $50 for non-smokers, and $60 for smokers, our model would then be $Y_{i}=50+10D+\epsilon _{i}$ . When the person we are looking at is a non-smoker, D = 0, and when the person we are looking at is a smoker, D = 1.

We can regress with multiple pieces of information (variables) too. We can also mix our normal data with several dummies. $HealthCare_{i}=\beta _{1}age_{i}+\beta _{2}Smoke_{i}+\beta _{3}Gender_{i}+\epsilon _{i}$ (Gender = 1 for male, Gender = 0 for female)

Our estimated model off of the data would be $HealthCare_{i}={\hat {\beta _{0}}}age_{i}+{\hat {\beta _{1}}}Smoke_{i}+{\hat {\beta _{2}}}Gender_{i}+\epsilon _{i}$

The formula for a 29 year old male who does not smoke would be $HealthCare_{i}={\hat {\beta _{0}}}29+{\hat {\beta _{2}}}+\epsilon$

Our Dummy Variables can be more than just binary. Say health care companies found out that happiness can lead to higher health, and they wanted to use that in their price discrimination scheme. They can ask "how happy are you?, Very Happy, Kind of Happy, Sad." However, they will need to use two dummies for this move. D1 will be 1 and D2 will be 0 if "Very Happy", D1 will be 0 and D2 will be 1 if "Kind of Happy" and D1 will be 0 and D2 will be 0 if "sad."

To add this into our model we will have $HealthCare_{i}={\hat {\beta _{0}}}age_{i}+{\hat {\beta _{1}}}Smoke_{i}+{\hat {\beta _{2}}}Gender_{i}+{\hat {\beta _{3}}}VeryHappy_{i}+{\hat {\beta _{4}}}KindOfHappy_{i}+\epsilon _{i}$

Slope vs Intercept shifts

The Dummies can affect the model in two ways. The Dummy can either shift the intercept up or down, or shift the slope shallower or deeper. What has been described above are all intercept shifts. The line stayed neutral for non-smokers, and moved up for smokers. For a slope shift, the dummy is in the same term as the standard variable as in $Y_{i}=\alpha +\beta _{1}X_{i}+\beta _{2}D_{i}X_{i}+\epsilon$ where if D = 1, $Y_{i}=\alpha +(\beta _{1}+\beta _{2})X_{i}+\epsilon$ and if D = 0 $Y_{i}=\alpha +\beta _{1}X_{i}+\epsilon$

Note: The combination of the dummy and the standard in this case is an interaction term. It is often described as one variable as in $D_{i}X_{i}=Z_{i}$