虽然Boosting多年来有所发展,但我们描述了最常用的AdaBoost过程版本(Freund和Schapire - 1996),我们称之为离散AdaBoost。这与Freund和Schapire中用于二元数据的AdaBoost.M1基本相同。以下是AdaBoost在两类分类设置中的简要描述。我们有训练数据,其中是向量值特征,而或1。我们定义,其中每个是一个产生加或减1值的分类器,而是常数;相应的预测是符号。AdaBoost在训练样本的加权版本上训练分类器,对当前错误分类的案例赋予更高的权重。这对于一系列加权样本都是如此,然后最终分类器被定义为每个阶段的分类器的线性组合。
install.packages("ada") library("rpart") library("ada")
ada(x, y,test.x,test.y=NULL, loss=c("exponential","logistic"), type=c("discrete", "real", "gentle"), iter=50, nu=0.1, bag.frac=model.coef=TRUE, bag.shift=FALSE, max.iter=20, delta=10^(-10), verbose=...,na.action=na.rpart)
x: matrix of descriptors. Y: vector of responses. ‘y’ may have only two unique values. test.x: testing matrix of descriptors (optional) test.y: vector of testing responses (optional) loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting under exponential loss. loss="logistic","l2","l" provides boosting under logistic loss. type: type of boosting algorithm to perform. “discrete” performs discrete Boosting (default). “real” performs Real Boost. “gentle” performs Gentle Boost. Iter: number of boosting iterations to perform. Default = 50. Nu: shrinkage parameter for boosting, default taken as 1. bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random permutation which improves performance. model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds to epsilon-boosting. bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero. This only makes sense if bag.frac is small. The rationale behind this parameter is discussed in (Culp et al., 2006). max.iter: number of iterations to perform in the newton step to determine the coefficient. delta: tolerance for convergence of the newton step to determine the coefficient. Verbose: print the number of iterations necessary for convergence of a coefficient. Formula: a symbolic description of the model to be fit. data: an optional data frame containing the variables in the model. Subset: an optional vector specifying a subset of observations to be used in the fitting process. na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart. ...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=- 1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp controls the complexity of trees. The priors should also be fixed through the parms argument as discussed in the second reference.
summary(AdaObject) varplot(VariableImportanceObject)
还可以使用命令“print(x)”获得此信息的摘要。相应的函数(使用帮助查看summary.ada、predict.ada、...varplot以获取有关这些命令的其他信息):summary :用于打印原始函数调用、用于Boosting的方法、迭代次数、最终混淆矩阵、准确率和Kappa统计量(观察到的分类与预测的分类之间的一致性度量)的摘要。‘summary’可用于训练、测试或验证数据。
predict :用于预测任何数据集(训练、测试或验证)的响应的函数
plot :用于绘制Boosting迭代中算法性能的函数。默认图是迭代次数(x轴)与用于构建模型的数据集的预测误差(y轴)。该函数还可以同时生成外部测试集的误差图以及训练集和测试集的Kappa图。
pairs :用于生成描述符的成对图的函数。描述符按Boosting选择的频率递减排列(左上=最常选择)。图中标记的颜色表示类别成员关系;标记的大小表示预测的类别概率。标记越大,分类概率越高。
varplot :根据变量重要性度量(基于改进)排序的变量图。
addtest :将测试数据集添加到ada对象中,因此测试误差只需计算一次。
update :向ada对象添加更多树。
x1 a numeric vector x2 a numeric vector x3 a numeric vector x4 a numeric vector x5 a numeric vector x6 a numeric vector x7 a numeric vector x8 a numeric vector x9 a numeric vector x10 a numeric vector x11 a numeric vector x12 a numeric vector x13 a numeric vector x14 a numeric vector x15 a numeric vector x16 a numeric vector x17 a numeric vector x18 a numeric vector x19 a numeric vector x20 a numeric vector . . . x72 a numeric vector with missing data y a numeric vector
data("soldat") n <- nrow(soldat) set.seed(100) ind <- sample(1:n) trainval <- ceiling(n * .5) testval <- ceiling(n * .3) train <- soldat[ind[1:trainval],] test <- soldat[ind[(trainval + 1):(trainval + testval)],] valid <- soldat[ind[(trainval + testval + 1):n],] control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0) gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73], type = "gentle", control = control, iter = 70) gen1 <- addtest(gen1, valid[,-73], valid[,73]) summary(gen1) varplot(gen1)
Loss: exponential Method: gentle Iteration: 70 Training Results Accuracy: 0.987 Kappa: 0.972 Testing Results Accuracy: 0.765 Kappa: 0.487
测试准确率按输入顺序打印,因此测试集上的准确率为0.765,验证集上的准确率为0.781。对于这种类型的早期药物发现数据,Gentle AdaBoost算法表现良好,测试集准确率为76.5%(kappa约为0.5)。为了增强我们对描述符与响应之间关系的理解,使用了varplot函数。