R Programming/Binomial Models

In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.

R Basics
Data Management
Graphics
- Grammar of graphics
Publication quality output
Descriptive Statistics
Mathematics
Statistical Core Methods
Regression Models
Time Series
Factor Analysis
Classification
- Ordination
- Clustering
Network Analysis
High Performance Computing
- Profiling R code
- Parallel computing with R
Appendix
- Sources
- Index

This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on $\beta$ , the estimated coefficients can imply probabilities outside the unit interval $[0,1]$ . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.

Logit model

The model takes the form : $y_{i}\sim Bernoulli(\pi _{i})$ with the inverse link function : $\pi _{i}={\frac {exp(x_{i}'\beta )}{(1+exp(x_{i}'\beta ))}}$ . It can be estimated using maximum likelihood or using bayesian methods.

Fake data simulations

> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)

Maximum likelihood estimation

The standard way to estimate a logit model is glm() function with family binomial and link logit.
lrm() (Design) is another implementation of the logistic regression model.
There is an implementation in the Zelig package^[1].

In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.

> res <- glm(y ~ x , family  = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res) 
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities

Zelig

The Zelig' package makes it easy to compute all the quantities of interest.

We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option.

We the look at the predicted values of y at the mean of x1 and x2
Then we look at the predicted values when x1 = 0 and x2 = 0
We also look at what happens when x1 changes from the 3rd to the 1st quartile.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
> 
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest

> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)

> # What happens if x1 change from the 3rd quartile to the 1st quartile ? 
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2)) 
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2)) 
> s.out2<-sim(z.out, x=x.high, x1=x.low) 
> plot(s.out2)

ROC Curve in the verification package.
Zelig has a rocplot() function.

See UCLA Statistical Computing example

Bayesian estimation

bayesglm() in the arm package
MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.

> # Data generating process
> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> 
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)

> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)

Probit model

The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.

We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)

Maximum likelihood

We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.

> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
> 
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)

Bayesian estimation

MCMCprobit() (MCMCpack)

> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)

Semi-Parametric models

Klein and Spady estimator^[3] is implemented in the np package^[4] (see npindex() with method = "kleinspady" option).

References

↑ Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
↑ UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
↑ Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
↑ Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.

Previous: Quantile Regression

Index

Next: Multinomial Ordered Models

[zeliglogit-1] Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig

[uclaprobit-2] UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm

[3] Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.

[4] Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.

[1]

[2]

[3]

[4]

R Programming/Binomial Models

Contents

Logit model

Fake data simulations

Maximum likelihood estimation

Zelig

Bayesian estimation

Probit model

Maximum likelihood

Bayesian estimation

See Also

Semi-Parametric models

References