Last modified on 18 June 2013, at 08:07

# R Programming/Binomial Models

In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.

This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on $\beta$, the estimated coefficients can imply probabilities outside the unit interval $[0,1]$ . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.

## Logit modelEdit

The model takes the form : $y_i \sim Bernoulli(\pi_i)$ with the inverse link function : $\pi_i = \frac{exp(x_i'\beta)} {(1 + exp(x_i'\beta))}$. It can be estimated using maximum likelihood or using bayesian methods.

### Fake data simulationsEdit

> x <- 1 + rnorm(1000,1)
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)


### Maximum likelihood estimationEdit

• The standard way to estimate a logit model is glm() function with family binomial and link logit.
• lrm() (Design) is another implementation of the logistic regression model.
• There is an implementation in the Zelig package[1].

In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.

> res <- glm(y ~ x , family  = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res)
> exp(res$coefficients) # odds ratio > exp(confint(res)) # Confidence intervals for odds ratio (delta method) > predict(res) # prediction on a linear scale > predict(res, type = "response") # predicted probabilities > plot(x, predict(res, type = "response")) # plot the predicted probabilities  #### ZeligEdit The Zelig' package makes it easy to compute all the quantities of interest. We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option. • We the look at the predicted values of y at the mean of x1 and x2 • Then we look at the predicted values when x1 = 0 and x2 = 0 • We also look at what happens when x1 changes from the 3rd to the 1st quartile. > x1 <- 1 + rnorm(1000) > x2 <- -1 + x1 + rnorm(1000) > xbeta <- -1 + x1 + x2 > proba <- exp(xbeta)/(1 + exp(xbeta)) > y <- ifelse(runif(1000,0,1) < proba,1,0) > mydat <- data.frame(y,x1,x2) > table(y) > > z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model > summary(z.out) > x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables > s.out <- sim(z.out, x = x.out) # simulating the quantities of interest > summary(s.out) > plot(s.out) # plot the quantities of interest > # the same with other values > x.out <- setx(z.out, x1 = 0, x2 = 0) > s.out <- sim(z.out, x = x.out) > summary(s.out) > # What happens if x1 change from the 3rd quartile to the 1st quartile ? > x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2)) > x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2))
> s.out2<-sim(z.out, x=x.high, x1=x.low)
> plot(s.out2)

• ROC Curve in the verification package.
• Zelig has a rocplot() function.

### Bayesian estimationEdit

• bayesglm() in the arm package
• MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> # Data generating process
> x <- 1 + rnorm(1000,1)
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
>
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)

> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)


## Probit modelEdit

The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.

We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)


### Maximum likelihoodEdit

We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.

> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
>
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)


### Bayesian estimationEdit

• MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)


### See AlsoEdit

• There is an example of a probit model with R on the UCLA statistical computing website[2].

## ReferencesEdit

1. Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
2. UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
3. Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
4. Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.