R语言数据挖掘算法/分类/AdaBoost

Boosting是分类方法学中最重要的发展之一。Boosting通过顺序地将分类算法应用于训练数据的重新加权版本，然后对由此产生的分类器序列进行加权多数投票来工作。对于许多分类算法，这种简单的策略导致性能的显著提高。这种看似神秘的现象可以用众所周知的统计原理来理解，即加性建模和最大似然。对于两类问题，Boosting可以被视为使用最大伯努利似然作为标准在逻辑尺度上对加性建模的近似。

技术/算法

算法

虽然Boosting多年来有所发展，但我们描述了最常用的AdaBoost过程版本（Freund和Schapire - 1996），我们称之为离散AdaBoost。这与Freund和Schapire中用于二元数据的AdaBoost.M1基本相同。以下是AdaBoost在两类分类设置中的简要描述。我们有训练数据 $(x_{1},y_{1}),...,(x_{n},y_{n})$ ，其中 $x_{i}$ 是向量值特征，而 $y_{i}=-1$ 或1。我们定义 $F(x)=\sum _{1}^{M}c_{m}f_{m}$ ，其中每个 $f_{m}(x)$ 是一个产生加或减1值的分类器，而 $c_{m}$ 是常数；相应的预测是符号 $(F(x))$ 。AdaBoost在训练样本的加权版本上训练分类器 $f_{m}(x)$ ，对当前错误分类的案例赋予更高的权重。这对于一系列加权样本都是如此，然后最终分类器被定义为每个阶段的分类器的线性组合。

实现

AdaBoost是ada包的一部分。在本节中，您可以找到有关在R环境中安装和使用它的更多信息。

在R控制台中输入以下命令以安装和加载ada包

install.packages("ada")
library("rpart")
library("ada")

用于执行AdaBoost算法的函数是

ada(x, y,test.x,test.y=NULL, loss=c("exponential","logistic"), type=c("discrete", "real", "gentle"), iter=50, nu=0.1, bag.frac=model.coef=TRUE, bag.shift=FALSE, max.iter=20, delta=10^(-10), verbose=...,na.action=na.rpart)

参数为

x: matrix of descriptors.

Y: vector of responses. ‘y’ may have only two unique values.

test.x: testing matrix of descriptors (optional)

test.y: vector of testing responses (optional)

loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting
under exponential loss. loss="logistic","l2","l" provides boosting under logistic
loss.

type: type of boosting algorithm to perform. “discrete” performs discrete Boosting
(default). “real” performs Real Boost. “gentle” performs Gentle Boost.

Iter: number of boosting iterations to perform. Default = 50.

Nu: shrinkage parameter for boosting, default taken as 1.

bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random
permutation which improves performance.

model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds
to epsilon-boosting.

bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero.
This only makes sense if bag.frac is small. The rationale behind this parameter is discussed in (Culp et al., 2006).

max.iter: number of iterations to perform in the newton step to determine the coefficient.

delta: tolerance for convergence of the newton step to determine the coefficient.

Verbose: print the number of iterations necessary for convergence of a coefficient.

Formula: a symbolic description of the model to be fit.

data: an optional data frame containing the variables in the model.

Subset: an optional vector specifying a subset of observations to be used in the fitting
process.

na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart.

...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=-
1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp
controls the complexity of trees. The priors should also be fixed through the
parms argument as discussed in the second reference.

输入以下命令以显示此算法的结果

summary(AdaObject)
varplot(VariableImportanceObject)

当使用“ada(x,y)”用法时：x数据可以采用data.frame或as.matrix的形式。y数据可以采用data.frame、as.factor、as.matrix、as.array或as.table的形式。在执行之前必须从数据中删除缺失值。

当使用“ada(y~.)”用法时：数据必须在数据框中。响应可以具有因子或数值。只要na.action设置为除na.pass以外的任何选项，描述符数据中都可能存在缺失值。

拟合模型后，“ada”将打印函数调用的摘要、用于Boosting的方法、迭代次数、最终混淆矩阵（观察到的分类与预测的分类；类的标签与响应中的相同）、训练集的误差以及测试、训练和Kappa估计的适当迭代次数。

还可以使用命令“print(x)”获得此信息的摘要。相应的函数（使用帮助查看summary.ada、predict.ada、...varplot以获取有关这些命令的其他信息）：summary ：用于打印原始函数调用、用于Boosting的方法、迭代次数、最终混淆矩阵、准确率和Kappa统计量（观察到的分类与预测的分类之间的一致性度量）的摘要。‘summary’可用于训练、测试或验证数据。

predict ：用于预测任何数据集（训练、测试或验证）的响应的函数

plot ：用于绘制Boosting迭代中算法性能的函数。默认图是迭代次数（x轴）与用于构建模型的数据集的预测误差（y轴）。该函数还可以同时生成外部测试集的误差图以及训练集和测试集的Kappa图。

pairs ：用于生成描述符的成对图的函数。描述符按Boosting选择的频率递减排列（左上=最常选择）。图中标记的颜色表示类别成员关系；标记的大小表示预测的类别概率。标记越大，分类概率越高。

varplot ：根据变量重要性度量（基于改进）排序的变量图。

addtest ：将测试数据集添加到ada对象中，因此测试误差只需计算一次。

update ：向ada对象添加更多树。

案例研究

场景

数据集包含药物发现中使用的化合物的相关信息。具体来说，该数据集包含5631种化合物，这些化合物进行了内部溶解度筛选（化合物在水/溶剂混合物中溶解的能力）。根据该筛选，化合物被归类为不溶（n=3493）或可溶（n=2138）。然后，针对每种化合物计算了72个连续的、有噪声的结构描述符。在这些描述符中，大约14%（n=787）的观测值缺少一个描述符的值。分析的目的是对结构描述符和溶解度类别之间的关系进行建模。该数据集将被称为soldat。

数据

输入格式

x1 a numeric vector
x2 a numeric vector
x3 a numeric vector
x4 a numeric vector
x5 a numeric vector
x6 a numeric vector
x7 a numeric vector
x8 a numeric vector
x9 a numeric vector
x10 a numeric vector
x11 a numeric vector
x12 a numeric vector
x13 a numeric vector
x14 a numeric vector
x15 a numeric vector
x16 a numeric vector
x17 a numeric vector
x18 a numeric vector
x19 a numeric vector
x20 a numeric vector
.
.
.
x72 a numeric vector with missing data
y a numeric vector

执行

data("soldat")
n <- nrow(soldat)
set.seed(100)
ind <- sample(1:n)
trainval <- ceiling(n * .5)
testval <- ceiling(n * .3)
train <- soldat[ind[1:trainval],]
test <- soldat[ind[(trainval + 1):(trainval + testval)],]
valid <- soldat[ind[(trainval + testval + 1):n],]

control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0)
gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73], type = "gentle", control = control, iter = 70)
gen1 <- addtest(gen1, valid[,-73], valid[,73])
summary(gen1)
varplot(gen1)

输出

Loss: exponential Method: gentle Iteration: 70
Training Results
Accuracy: 0.987 Kappa: 0.972
Testing Results
Accuracy: 0.765 Kappa: 0.487

分析

测试准确率按输入顺序打印，因此测试集上的准确率为0.765，验证集上的准确率为0.781。对于这种类型的早期药物发现数据，Gentle AdaBoost算法表现良好，测试集准确率为76.5%（kappa约为0.5）。为了增强我们对描述符与响应之间关系的理解，使用了varplot函数。

参考文献

Meira Jr., W.; Zaki, M. 数据挖掘算法基础。 [1]
CBA R包。 [2]
加性逻辑回归：提升方法的统计视角，作者：Jerome Friedman、Trevor Hastie 和 Robert Tibshirani