R语言机器学习：caret包使用及其黑箱模型解释（离散变数预测）

作者：黄天元，复旦大学博士在读，热爱数据科学与开源工具（R），致力于利用数据科学迅速积累行业经验优势和学术知识发现。知乎专栏：R语言数据挖掘 邮箱：[email protected].欢迎合作交流。

上次在

HopeR：R语言机器学习：caret包使用及其黑箱模型解释（连续变数预测)?

zhuanlan.zhihu.com

中介绍了如何使用caret包建模并采用DALEX包进行模型的解释。当时是针对连续型变数进行探索，这次我们针对响应变数为离散变数（分类变数）的模型进行黑箱解释。

1 包的载入与数据导入

安装4个包。

library(pacman) p_load(DALEX,caret,tidyverse,breakDown)

观察我们要使用的目标数据：

library(breakDown) data(wine) ? wine %>% as_tibble

# A tibble: 4,898 x 12
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dio~ total.sulfur.di~ density
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 0.27 0.36 20.7 0.045 45 170 1.00
2 6.3 0.3 0.34 1.6 0.049 14 132 0.994
3 8.1 0.28 0.4 6.9 0.05 30 97 0.995
4 7.2 0.23 0.32 8.5 0.058 47 186 0.996
5 7.2 0.23 0.32 8.5 0.058 47 186 0.996
6 8.1 0.28 0.4 6.9 0.05 30 97 0.995
7 6.2 0.32 0.16 7 0.045 30 136 0.995
8 7 0.27 0.36 20.7 0.045 45 170 1.00
9 6.3 0.3 0.34 1.6 0.049 14 132 0.994
10 8.1 0.22 0.43 1.5 0.044 28 129 0.994
# ... with 4,888 more rows, and 4 more variables: pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <fct>

下面，我们对酒的质量进行因子花，如果质量数值大于5，则标志为1，否则标志为0.

wine$quality <- factor(ifelse(wine$quality > 5, 1, 0))

2 使用caret包迅速建模

本次建模过程中，我们会手动划分训练集和测试集。其中，训练集的比例为60%。

trainIndex <- createDataPartition(wine$quality, p = 0.6, list = FALSE, times = 1) wineTrain <- wine[ trainIndex,] wineTest <- wine[-trainIndex,]

随后，我们开始建立模型。我们分别创建随机森林、逻辑回归和支持向量机模型。除了酒的质量以外，其他变数都设为解释变数。

classif_rf <- train(quality~., data = wineTrain, method="rf", ntree = 100, tuneLength = 1) ? classif_glm <- train(quality~., data = wineTrain, method="glm", family="binomial") ? classif_svm <- train(quality~., data = wineTrain, method="svmRadial", prob.model = TRUE, tuneLength = 1)

3 对模型进行解释

这里直接利用DALEX包的explain函数对三个模型进行解释性分析。需要注意的是，做这个分析需要包含4个信息：1.模型信息；2.标签信息（如果没有，会自动从模型抽取）；3.验证数据集；4.验证数据集中哪个是响应变数。除此以外，在这个问题中特别需要注意的是：这些建模函数默认响应变数是因子变数的时候，才会进行分类。但是就得到的结果而言，必须是似然函数得到的概率值，才能够对其进行更加精细的解释。所以，必须把预测函数进行修饰，让它能够返回极大似然估计，而不是0和1。此外，对于验证数据集而言，必须要把它的响应变数转化为数值，才能够与预测值进行比较。代码如下：

#定义求得似然估计的函数 p_fun <- function(object, newdata){predict(object, newdata=newdata, type="prob")[,2]} ? #把测试集的响应变数转化为数值 yTest <- as.numeric(as.character(wineTest$quality)) ? explainer_classif_rf <- DALEX::explain(classif_rf, label = "rf", data = wineTest, y = yTest, predict_function = p_fun) ? explainer_classif_glm <- DALEX::explain(classif_glm, label = "glm", data = wineTest, y = yTest, predict_function = p_fun) ? explainer_classif_svm <- DALEX::explain(classif_svm, label = "svm", data = wineTest, y = yTest, predict_function = p_fun)

建模可能很久，但是解释性验证是非常快的，直接是黑箱的映射关系。

4 模型表现

后面的步骤，基本与之前连续变数的相似，不再进行赘述，感兴趣请观看https://zhuanlan.zhihu.com/p/59329696。

mp_classif_rf <- model_performance(explainer_classif_rf) mp_classif_glm <- model_performance(explainer_classif_glm) mp_classif_svm <- model_performance(explainer_classif_svm) ? plot(mp_classif_rf, mp_classif_glm, mp_classif_svm)

plot(mp_classif_rf, mp_classif_glm, mp_classif_svm, geom = "boxplot")

就这个结果看来，随机森林的效果最好。

5 变数重要性分析

需要看每个模型中，不同变数对于模型预测的相对重要性，可以用如下方法。

vi_classif_rf <- variable_importance(explainer_classif_rf, loss_function = loss_root_mean_square) vi_classif_glm <- variable_importance(explainer_classif_glm, loss_function = loss_root_mean_square) vi_classif_svm <- variable_importance(explainer_classif_svm, loss_function = loss_root_mean_square) ? plot(vi_classif_rf, vi_classif_glm, vi_classif_svm)

损失函数使用的是RMSE，这里解释为：如果模型少了这个变数，将会给响应变数的预测值带来多大影响（我们已经把因子编程数值，因此最小化的还是RMSE）？自信观察可以发现，rf和glm两个模型大相径庭，也就是说这些模型对特征的利用方法不同，导致它们对特征看重的程度不同。我自己把它称之为模型的特征利用率。

6 变数解析

下面，我们分别做PDP图和ALE图。我们先对pH值这个变数进行分析，PDP图如下：

pdp_classif_rf <- variable_response(explainer_classif_rf, variable = "pH", type = "pdp") pdp_classif_glm <- variable_response(explainer_classif_glm, variable = "pH", type = "pdp") pdp_classif_svm <- variable_response(explainer_classif_svm, variable = "pH", type = "pdp") ? plot(pdp_classif_rf, pdp_classif_glm, pdp_classif_svm)

我们看到，在前面认为残差较小的随机森林模型中，随著pH值的变化，整个响应变数的变化是非线性的。SVM也稍微能够捕捉到这个信息，但是逻辑回归则完全不行。下面来看ALE图：

ale_classif_rf <- variable_response(explainer_classif_rf, variable = "alcohol", type = "ale") ale_classif_glm <- variable_response(explainer_classif_glm, variable = "alcohol", type = "ale") ale_classif_svm <- variable_response(explainer_classif_svm, variable = "alcohol", type = "ale") ? plot(ale_classif_rf, ale_classif_glm, ale_classif_svm)