這篇文章主要記錄自己在摸索模型解釋性方法論的學習心得。

更新時間:2019/1/24

  1. LIME
  2. Shap
  3. 傳統模型重要度解釋方法

LIME

1、LIME:TPOT輸出模型與普通機器學習模型解決方案

# import lime tools
import lime
import lime.lime_tabular
# generate an "explainer" object
categorical_features = np.argwhere(np.array([len(set(x_T.values[:,x])) for x in range(x_T.shape[1])]) <= 10).flatten()
explainer = lime.lime_tabular.LimeTabularExplainer(x_T.values, feature_names=x_T.columns.values, categorical_features=categorical_features, verbose=False, mode=regression,discretize_continuous=False)

#generate an explanation
i = 9
exp = explainer.explain_instance(x_T.values[i], model.predict, num_features=100)

%matplotlib inline
fig = exp.as_pyplot_figure()

由於工作需要,主要研究的是回歸模型,這裡將由TPOT輸出的模型,在LIME上的試驗結果展示出來

  • 成功模型

exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
ExtraTreesRegressor(bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=4, min_samples_split=10, n_estimators=100)
)

exported_pipeline = make_pipeline(
StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.3, min_samples_leaf=4, min_samples_split=6, n_estimators=100)),
StackingEstimator(estimator=RidgeCV()),
RandomForestRegressor(bootstrap=True, max_features=0.9500000000000001, min_samples_leaf=8, min_samples_split=12, n_estimators=100)
)

exported_pipeline = ElasticNetCV(l1_ratio=1.0, tol=0.1)

exported_pipeline = make_pipeline(
make_union(
FastICA(tol=0.7000000000000001),
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
),
RandomForestRegressor(bootstrap=False, max_features=0.1, min_samples_leaf=1, min_samples_split=8, n_estimators=100)
)

exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=ExtraTreesRegressor(bootstrap=False, max_features=0.8500000000000001, min_samples_leaf=3, min_samples_split=5, n_estimators=100)),
ExtraTreesRegressor(bootstrap=False, max_features=0.7500000000000001, min_samples_leaf=3, min_samples_split=5, n_estimators=100)
)

exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=RidgeCV()),
StackingEstimator(estimator=XGBRegressor(learning_rate=1.0, max_depth=4, min_child_weight=5, n_estimators=100, nthread=1, subsample=0.25)),
StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.85, learning_rate=1.0, loss="huber", max_depth=8, max_features=0.3, min_samples_leaf=9, min_samples_split=11, n_estimators=100, subsample=0.15000000000000002)),
ExtraTreesRegressor(bootstrap=False, max_features=0.45, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)

exported_pipeline = ExtraTreesRegressor(bootstrap=False, max_features=0.9000000000000001, min_samples_leaf=2, min_samples_split=2, n_estimators=100)

exported_pipeline = make_pipeline(
ZeroCount(),
MinMaxScaler(),
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
XGBRegressor(learning_rate=0.1, max_depth=1, min_child_weight=9, n_estimators=100, nthread=1, subsample=0.8)
)

  • 失敗模型

exported_pipeline = make_pipeline(
Normalizer(norm="l1"),
StackingEstimator(estimator=LassoLarsCV(normalize=False)),
RidgeCV()
)

exported_pipeline = make_pipeline(
MinMaxScaler(),
SelectPercentile(score_func=f_regression, percentile=95),
StackingEstimator(estimator=RandomForestRegressor(bootstrap=False, max_features=0.35000000000000003, min_samples_leaf=2, min_samples_split=12, n_estimators=100)),
KNeighborsRegressor(n_neighbors=85, p=1, weights="distance")
)
#ValueError: Your model needs to output single-dimensional numpyarrays, not arrays of (5000, 1) dimensions

bug解決方案:

更新時間:2019/1/25

https://github.com/keras-team/keras/issues/10123?

github.com

經過與我自己的代碼融合後為

qc = x_T.as_matrix()[1]
qc_reshape = qc.reshape(1,-1)

def predict(qc):
global model
qc = model.predict(qc)
return qc.reshape(qc.shape[0])

import lime
import lime.lime_tabular
import pandas as pd

# def flatten_predict(input):
# return model.predict(input).flatten()

categorical_features = np.argwhere(np.array([len(set(x_T.values[:,x])) for x in range(x_T.shape[1])]) <= 10).flatten()
explainer = lime.lime_tabular.LimeTabularExplainer(x_T.values, feature_names=x_T.columns.values, categorical_features=categorical_features, verbose=False, mode=regression,discretize_continuous=False)

#generate an explanation
i = 9
exp = explainer.explain_instance(qc,predict,num_features=100)

%matplotlib inline
fig = exp.as_pyplot_figure()

用以上代碼替代最初的lime解釋繪圖用代碼,完美解決了RidgeCV(),KNeighborsRegressor()兩個模型的報錯問題,並且在原先成功的模型上依然表現穩定,隨後我用這個模型用循環語句測試了70個模型,均無錯誤,至此宣告,此代碼穩定性不錯。

2、LIME:LSTM模型解釋代碼解決方案。

參考資料:marcotcr/lime


Shap

import shap
shap.initjs()

X = X_train
shap_values = shap.TreeExplainer(model_xgb).shap_values(X_train)

global_shap_vals = np.abs(shap_values).mean(0)
inds = np.argsort(global_shap_vals)
y_pos = np.arange(X.shape[1])
for item in inds:
print(str(X.columns[item])+---+str(global_shap_vals[item]))
plt.barh(y_pos, global_shap_vals[inds], color="#1E88E5")
plt.yticks(y_pos, X.columns[inds])
plt.gca().spines[right].set_visible(False)
plt.gca().spines[top].set_visible(False)
plt.xlabel("mean SHAP value magnitude (change in log odds)")
plt.gcf().set_size_inches(15, 6.5)
plt.show()

bug報錯

推薦閱讀:

相關文章