R語言機器學習：caret包使用及其黑箱模型解釋（連續變數預測)

作者：黃天元，復旦大學博士在讀，熱愛數據科學與開源工具（R），致力於利用數據科學迅速積累行業經驗優勢和學術知識發現。知乎專欄：R語言數據挖掘 郵箱：[email protected].歡迎合作交流。

caret包是R語言通用機器學習包之一，能夠在統一框架下使用各種不同的模型，從預處理、建模到後期的預測、評估都有非常友好的函數封裝。新近學習的DALEX包是給黑箱提供模型解釋性的利器。事實上，它不僅僅針對黑箱模型，它能夠面向所有模型給出表現的評估、變數的重要性等有價值的信息。本文依照官方文檔，嘗試習得通用的DALEX解釋caret包生成模型的套路。

1 包的載入與數據導入

安裝三個包。

library(pacman) p_load(DALEX,caret,tidyverse)

觀察我們要使用的目標數據：

apartments %>% as_tibble

# A tibble: 1,000 x 6
m2.price construction.year surface floor no.rooms district
<dbl> <dbl> <dbl> <int> <dbl> <fct>
1 5897 1953 25 3 1 Srodmiescie
2 1818 1992 143 9 5 Bielany
3 3643 1937 56 1 2 Praga
4 3517 1995 93 7 3 Ochota
5 3013 1992 144 6 5 Mokotow
6 5795 1926 61 6 2 Srodmiescie
7 2983 1970 127 8 5 Mokotow
8 2346 1985 105 8 4 Ursus
9 4745 1928 145 6 6 Srodmiescie
10 4284 1949 112 9 4 Srodmiescie
# ... with 990 more rows

2 使用caret包迅速建模

這裡，以m2.price作為響應變數，其餘所有變數作為解釋變數，進行建模。嘗試模型包括：隨機森林、GBM和神經網路。其中，隨機森林設置樹的數量為100，GBM使用默認設置，神經網路在預處理的時候要進行中心化和標準化，最大迭代次數設置為500次，使用線性輸出單元，並設置網格對超參數進行優化的選項（這裡用了兩個隱藏層，權重衰減參數設為0，只設置了一個值，沒有用網格去優化）。代碼如下：

#下面這串代碼的運行可能要等待一段時間

set.seed(123)
regr_rf <- train(m2.price~., data = apartments, method="rf", ntree = 100)

regr_gbm <- train(m2.price~. , data = apartments, method="gbm")

regr_nn <- train(m2.price~., data = apartments,
method = "nnet",
linout = TRUE,
preProcess = c(center, scale),
maxit = 500,
tuneGrid = expand.grid(size = 2, decay = 0),
trControl = trainControl(method = "none", seeds = 1))

3 對模型進行解釋

這裡直接利用DALEX包的explain函數對三個模型進行解釋性分析。需要注意的是，做這個分析需要包含4個信息：1.模型信息；2.標籤信息（如果沒有，會自動從模型抽取）；3.驗證數據集；4.驗證數據集中哪個是響應變數。代碼如下：

data(apartmentsTest)

explainer_regr_rf <- DALEX::explain(regr_rf, label="rf",
data = apartmentsTest, y = apartmentsTest$m2.price)

explainer_regr_gbm <- DALEX::explain(regr_gbm, label = "gbm",
data = apartmentsTest, y = apartmentsTest$m2.price)

explainer_regr_nn <- DALEX::explain(regr_nn, label = "nn",
data = apartmentsTest, y = apartmentsTest$m2.price)

建模可能很久，但是解釋性驗證是非常快的，直接是黑箱的映射關係。

4 模型表現

對模型的表現，需要進行分析：

mp_regr_rf <- model_performance(explainer_regr_rf) mp_regr_gbm <- model_performance(explainer_regr_gbm) mp_regr_nn <- model_performance(explainer_regr_nn)

我們看看得到的結果是什麼樣子的：

mp_regr_rf

這是樣本的殘差分佈情況，讓我們對這個分佈進行可視化（累計殘差分佈圖）：

plot(mp_regr_rf, mp_regr_nn, mp_regr_gbm)

這個圖的正確解釋方法是，少數的樣本（離羣點）貢獻了大量的殘差（與真實值的偏差）。如果線在上面，那麼大量的樣本殘差都很大，此圖表明GBM模型大部分樣本的殘差都比較小，而神經網路很多樣本的殘差都比基於樹模型的高。讓我們採用另一種可視化方法：

plot(mp_regr_rf, mp_regr_nn, mp_regr_gbm, geom = "boxplot")

高下立判，紅點為均值，箱線圖則為分位數。

5 變數重要性分析

需要看每個模型中，不同變數對於模型預測的相對重要性，可以用如下方法。

vi_regr_rf <- variable_importance(explainer_regr_rf, loss_function = loss_root_mean_square) vi_regr_gbm <- variable_importance(explainer_regr_gbm, loss_function = loss_root_mean_square) vi_regr_nn <- variable_importance(explainer_regr_nn, loss_function = loss_root_mean_square)

plot(vi_regr_rf, vi_regr_gbm, vi_regr_nn)

損失函數使用的是RMSE，這裡解釋為：如果模型少了這個變數，將會給響應變數的預測值帶來多大影響？

6 變數解析

6.1 連續型變數解析

Partial Dependence Plots (PDP)，是解釋單個連續型解釋變數與響應變數關係的方法。專門有相關的包和論文描述這個方法的機理，詳情請去找pdp包的官方文檔。比如我們想要研究房屋建築年份（construction.year）對響應變數房價的影響，我們這樣做：

pdp_regr_rf <- variable_response(explainer_regr_rf, variable = "construction.year", type = "pdp") pdp_regr_gbm <- variable_response(explainer_regr_gbm, variable = "construction.year", type = "pdp") pdp_regr_nn <- variable_response(explainer_regr_nn, variable = "construction.year", type = "pdp")

plot(pdp_regr_rf, pdp_regr_gbm, pdp_regr_nn)

從隨機森林和GBM模型可以看出來，建築年份與放假具有非線性關係。特別老的房子和新建的房子房價都很貴，但是40年代到90年代的房子則價格較低。不過，神經網路模型不能很好地捕捉這個規律。此外，還有一種方法稱為Acumulated Local Effects (ALE)，是為瞭解決變數相關性的問題設計的，本質上是PDP方法的延伸。實現方法如下：

ale_regr_rf <- variable_response(explainer_regr_rf, variable = "construction.year", type = "ale") ale_regr_gbm <- variable_response(explainer_regr_gbm, variable = "construction.year", type = "ale") ale_regr_nn <- variable_response(explainer_regr_nn, variable = "construction.year", type = "ale")

plot(ale_regr_rf, ale_regr_gbm, ale_regr_nn)

6.2 離散型變數解析

對於離散型變數，DALEX包目前的解析方法是調用了factorMerger包的mergeFactors函數。

mpp_regr_rf <- variable_response(explainer_regr_rf, variable = "district", type = "factor") mpp_regr_gbm <- variable_response(explainer_regr_gbm, variable = "district", type = "factor") mpp_regr_nn <- variable_response(explainer_regr_nn, variable = "district", type = "factor")

plot(mpp_regr_rf, mpp_regr_gbm, mpp_regr_nn)