R Programming/Linear Models
Standard linear model
editIn this section we present estimation functions for the standard linear model estimated by ordinary least squares (OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().
Fake data simulations
editWe first generate a fake dataset such that there is no hetereoskedasticity, no endogeneity and no correlation between the error terms. Therefore the ordinary least square estimator is unbiased and efficient. We choose a model with two variables and take all the coefficients equal to one.
> N <- 1000
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> y <- 1 + x1 + x2 + u
> df <- data.frame(y,x1,x2)
Least squares estimation
edit- The standard function to estimate a simple linear model is
lm()
. - lsfit() performs the least square procedure but the output is not formatted in fashionable way.
- ols() (Design) is another alternative.
We estimate the model using lm(). We store the results in fit and print the result using summary() which is the standard function.
> fit <- lm(y ~ x1 + x2, data = df)
> summary(fit)
There are some alternative to display the results.
- display() in the arm package is one of them.
- coefplot() (arm) graphs the estimated coefficients with confidence intervals. This is a good way to present the results.
- mtable() in the memisc package can display the results of a set of regressions in the same table.
> library("arm")
> display(fit)
> coefplot(fit)
fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions to fit.
We can get the estimated coefficients using fit$coeff or coef(fit).
> fit$coeff
(Intercept) x1 x2
1.2026522 0.8427403 1.5146775
> coef(fit)
(Intercept) x1 x2
0.7541 1.7844 0.7222
> output <- summary(fit)
> coef(output)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1945847 0.2298888 5.196359 0.001258035
x1 0.6458170 0.3423214 1.886581 0.101182585
x2 0.6175165 0.2083628 2.963660 0.020995713
se.coef() (arm) returns the standard error of the estimated coefficients.
The vector of fitted values can be returned via fit$fitted, fitted(fit) or the predict() function. The predict() function also returns standard error and confidence intervals for predictions.
> fit$fitted
> fitted(fit)
The vector of residuals:
> fit$resid
> residuals(fit)
The number of degrees of freedom :
> fit$df
Confidence intervals
editWe can get the confidence intervals using confint() or conf.intervals() in the alr3 package.
> confint(fit, level = .9)
5 % 95 %
(Intercept) -0.7263261 1.200079
x1 -0.5724022 1.909924
x2 0.6185011 2.475079
> confint(fit, level = .95)
2.5 % 97.5 %
(Intercept) -0.9652970 1.439050
x1 -0.8803353 2.217858
x2 0.3881923 2.705388
> confint(fit, level = .99)
0.5 % 99.5 %
(Intercept) -1.5422587 2.016012
x1 -1.6237963 2.961319
x2 -0.1678559 3.261436
> library(alr3)
> conf.intervals(fit)
2.5 % 97.5 %
(Intercept) -0.9652970 1.439050
x1 -0.8803353 2.217858
x2 0.3881923 2.705388
Tests
editcoeftest() (lmtest) performs the Student t test and z test on coefficients.
> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)
linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test using statistics.
> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0
See also waldtest() (lmtest) for nested models.
Analysis of variance
editWe can also make an analysis of variance using anova().
> anova(fit)
Model Search and information criteria
edit> # Akaike Information Criteria
> AIC(fit)
[1] 26.72857
> # Bayesian Information Criteria
> AIC(fit,k=log(N))
[1] 27.93891
The stats4 package includes AIC() and BIC() function:
> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226
The step() functions performs a model search using the Akaike Information Criteria.
> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> x3 <- rnorm(N)
> y <- 1+ x1 + x2 + u
> fit <- lm(y~x1+x2 + x3)
> step.fit <- step(fit)
Zelig
edit- The method is also supported in Zelig
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> z.out <- zelig(y ~ x, model = "ls", data = mydat)
> x.out <- setx(z.out, x = 10)
> s.out <- sim(z.out, x.out)
> summary(s.out)
Bayesian estimation
edit- MCMCregress() (MCMCpack)
- BLR() (BLR)
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
>
> posterior <- MCMCregress(y ~ x, data = mydat)
> summary(posterior)
> plot(posterior)
Heteroskedasticity
edit- See the lmtest and sandwich packages.
- gls() (nlme) computes the generalized least squares estimator.
- See "Cluster-robust standard errors using R" (pdf) by Mahmood Arai. He suggests two functions for cluster robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be loaded with the following command source("http://people.su.se/~ma/clmclx.R").
> N <- 10 # 10 people
> T <- 5 # 5 times
> id <- rep(1:N,T)
> f <- rep(rnorm(N),T) # is individual specific
> u <- rnorm(N*T)
> x1 <- rnorm(N*T)
> x2 <- rnorm(N*T) + x1
> y <- 1 + x1 + x2 + f + u
> fit <- lm(y ~ x1 + x2 )
> source("http://people.su.se/~ma/clmclx.R")
> clx(fit, 1, id)
Robustness
edit>library(car)
> cookd(fit)
1 2 3 4 5
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699
6 7 8 9 10
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871
Influence plot:
> influence.plot(fit)
Leverage plots:
> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)
Bonferroni's outlier test:
> outlier.test(fit)
max|rstudent| = 2.907674, degrees of freedom = 6,
unadjusted p = 0.02706231, Bonferroni p = 0.2706231
Observation: 3
See also outlier.t.test() in the alr3 package.
- inf.index() in the alr3 package computes all the robustness statistics (Cook's distance, studentized residuals, outlier test, etc)
- rlm() performs a robust estimation
- See UCLA example
- See also the robustbase package
Instrumental Variables
edit- ivreg() in the AER package[1]
- tsls() in the sem package.
- It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.
Fake data simulations
editWe first simulate a fake data set with x correlated to u, z and u independent and x correlated with z. Thus x is an endogenous explanatory variable of y and z is a valid instrument for x.
> N <- 1000
> z <- rnorm(N)
> u <- rnorm(N)
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u
Two stage least squares
editThen we estimate the model with OLS (lm()) and IV using z as an instrument for x.
> ols <- lm(y ~ x)
> summary(ols) # ols are biased
> library("AER")
> iv <- ivreg(y ~ x | z)
> summary(iv) # IV estimates are unbiased
> library("sem")
> iv2 <- tsls(y ~ x, instruments = ~ z)
> summary(iv2)
> library("gmm")
> iv3 <- gmm(y ~ x, z)
> summary(iv3)
We plot the results :
> plot(y ~ x, col = "gray")
> abline(a = 1,b = 1, lty = 1, col = 1, lwd = 2)
> abline(ols, lty = 2, col = 2 , lwd = 2)
> abline(iv, lty = 3, col = 3, lwd = 2)
> legend("topleft", legend = c("True values","OLS","IV"), col = 1:3, lwd = rep(2,3), lty = 1:3)
Panel Data
editplm() (plm) implements the standard random effect, fixed effect, first differences methods[2]. It is similar to Stata's xtreg command.
Note that plm output are not compatible with xtable() and mtable() for publication quality output.
- lme4 and gee implements random effect and multilevel models.
- See also BayesPanel
Random effects model
editTo implement a random effects model we generate a fake data set with 1000 observations over 5 time periods.
> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T)
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])
We estimate the random effect model with the plm() function and the model = "random" option.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)
Fixed effects model
editFor a fixed effects model we generate a fake dataset and we correlate the fixed effects f with covariates :
> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) + long$f
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])
We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm() with model = "within" as an option. Then, we compare the estimate with the random effect model and perform an Hausman test. At the end, we plot the density of the fixed effects.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))
Dynamic panel data
editSimultaneous equations model
editFor a [:w:Simultaneous_equations_model|simultaneous equations model] the following packages are needed :
- sem package
- systemfit package
References
edit- ↑ Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
- ↑ Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http://www.jstatsoft.org/v27/i02/.
- ↑ M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The Review of Economic Studies, 1991
- ↑ David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http://ideas.repec.org/c/boc/bocode/s435901.html