Data Mining Algorithms In R/Classification/adaboost

Boosting is one of the most important developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. This seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion.

Technique/Algorithm edit

Algorithm edit

While boosting has evolved somewhat over the years, we describe the most commonly used version of the AdaBoost procedure (Freund and Schapire - 1996) which we call Discrete AdaBoost. This is essentially the same as AdaBoost.M1 for binary data in Freund and Schapire. Here is a concise description of AdaBoost in the two-class classification setting.We have training data $(x_{1},y_{1}),...,(x_{n},y_{n})$ with $x_{i}$ a vector valued feature and $y_{i}=-1$ or 1. We define $F(x)=\sum _{1}^{M}c_{m}f_{m}$ where each $f_{m}(x)$ is a classifier producing values plus or minus 1 and $c_{m}$ are constants; the corresponding prediction is sign $(F(x))$ . The AdaBoost trains the classifiers $f_{m}(x)$ on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighted samples, and then the final classifier is defined to be a linear combination of the classifiers from each stage.

Implementation edit

Adaboost is part of ada package. In this section you find more information about installing and using it on R Environment.

Type the following commands in R console to install and load the ada package:

install.packages("ada")
library("rpart")
library("ada")

The function used to execute the algorithm adaboost is:

ada(x, y,test.x,test.y=NULL, loss=c("exponential","logistic"), type=c("discrete", "real", "gentle"), iter=50, nu=0.1, bag.frac=model.coef=TRUE, bag.shift=FALSE, max.iter=20, delta=10^(-10), verbose=...,na.action=na.rpart)

The arguments are:

x: matrix of descriptors.

Y: vector of responses. ‘y’ may have only two unique values.

test.x: testing matrix of descriptors (optional)

test.y: vector of testing responses (optional)

loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting
under exponential loss. loss="logistic","l2","l" provides boosting under logistic
loss.

type: type of boosting algorithm to perform. “discrete” performs discrete Boosting
(default). “real” performs Real Boost. “gentle” performs Gentle Boost.

Iter: number of boosting iterations to perform. Default = 50.

Nu: shrinkage parameter for boosting, default taken as 1.

bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random
permutation which improves performance.

model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds
to epsilon-boosting.

bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero.
This only makes sense if bag.frac is small. The rationale behind this parameter is discussed in (Culp et al., 2006).

max.iter: number of iterations to perform in the newton step to determine the coefficient.

delta: tolerance for convergence of the newton step to determine the coefficient.

Verbose: print the number of iterations necessary for convergence of a coefficient.

Formula: a symbolic description of the model to be fit.

data: an optional data frame containing the variables in the model.

Subset: an optional vector specifying a subset of observations to be used in the fitting
process.

na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart.

...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=-
1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp
controls the complexity of trees. The priors should also be fixed through the
parms argument as discussed in the second reference.

Type the following command to show the result from this algorithm:

summary(AdaObject)
varplot(VariableImportanceObject)

When using usage 'ada(x,y)': x data can take the form data.frame or as.matrix. y data can take form data.frame, as.factor, as.matrix, as.array, or as.table. missing values must be removed from the data prior to execution.

When using usage 'ada(y~.)': data must be in a data frame. Response can have factor or numeric values. missing values can be present in the descriptor data, whenever na.action is set to any option other than na.pass.

After the model is fit, 'ada' prints a summary of the function call, the method used for boosting, the number of iterations, the final confusion matrix (observed classification vs predicted classification; labels for classes are same as in response), the error for the training set, and testing, training , and kappa estimates of the appropriate number of iterations.

A summary of this information can also be obtained with the command ‘print(x)’. Corresponding functions (Use help with summary.ada, predict.ada, . . . varplot for additional information on these commands): summary : function to print a summary of the original function call, method used for boosting, number of iterations, final confusion matrix, accuracy, and kappa statistic (a measure of agreement between the observed classification and predicted classification). ‘summary’ can be used for training, testing, or validation data.

predict : function to predict the response for any data set (train, test, or validation)

plot : function to plot performance of the algorithm across boosting iterations. Default plot is iteration number (x-axis) versus prediction error (y-axis) for the data set used to build the model. Function can also simultaneously produce an error plot for an external test set and a kappa plot for training and test sets.

pairs : function to produce pairwise plots of descriptors. Descriptors are arranged by decreasing frequency of selection by boosting (upper left = most frequently chosen). The color of the marker in the plot represents class membership; the Size of the marker represents predicted class probability. The larger the marker, the higher the probability of classification.

varplot : plot of variables ordered by the variable importance measure (based on improvement).

addtest : add a testing data set to the ada object, therefore the testing errors only have to be computed once.

update : add more trees to the ada object.

Case Study edit

Scenario edit

A data set that contains information about compounds used in drug discovery. Specifically, this data set consists of 5631 compounds on which an in-house solubility screen (ability of a compound to dissolve in a water/solvent mixture) was performed. Based on this screen, compounds were categorized as either insoluble (n=3493) or soluble (n=2138). Then, for each compound, 72 continuous, noisy structural descriptors were computed. Of these descriptors, one contained missing values for approximately 14% (n=787) of the observations. The objective of the analysis is to model the relationship between the structural descriptors and the solubility class. The data will be called soldat.

Data edit

Input format:

x1 a numeric vector
x2 a numeric vector
x3 a numeric vector
x4 a numeric vector
x5 a numeric vector
x6 a numeric vector
x7 a numeric vector
x8 a numeric vector
x9 a numeric vector
x10 a numeric vector
x11 a numeric vector
x12 a numeric vector
x13 a numeric vector
x14 a numeric vector
x15 a numeric vector
x16 a numeric vector
x17 a numeric vector
x18 a numeric vector
x19 a numeric vector
x20 a numeric vector
.
.
.
x72 a numeric vector with missing data
y a numeric vector

Execution edit

data("soldat")
n <- nrow(soldat)
set.seed(100)
ind <- sample(1:n)
trainval <- ceiling(n * .5)
testval <- ceiling(n * .3)
train <- soldat[ind[1:trainval],]
test <- soldat[ind[(trainval + 1):(trainval + testval)],]
valid <- soldat[ind[(trainval + testval + 1):n],]

control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0)
gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73], type = "gentle", control = control, iter = 70)
gen1 <- addtest(gen1, valid[,-73], valid[,73])
summary(gen1)
varplot(gen1)

Output edit

Loss: exponential Method: gentle Iteration: 70
Training Results
Accuracy: 0.987 Kappa: 0.972
Testing Results
Accuracy: 0.765 Kappa: 0.487

Analysis edit

Testing accuracy rates are printed in the order they are entered so the accuracy on the testing set is 0.765 and on the validation set 0.781. For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately with test set accuracy of 76.5% (kappa is approximately 0.5). In order to enhance our understanding regarding the relationship between descriptors and the response, the varplot function was employed.

References edit

Meira Jr., W.; Zaki, M. Fundamentals of Data Mining Algorithms. [1]
CBA R package. [2]
ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING, by Jerome Friedman, Trevor Hastie and Robert Tibshirani