Introduction

This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.

研究目標

The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.

Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

評價指標

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

  • The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
  • The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.

導入相關模塊

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

數據讀取

data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk )
data.head() #前5行
data.shape #行列數
data.info()

數據清洗

缺失值

# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()

# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)

# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})

# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)

# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")

# Return the dataframe with missing information
return mis_val_table_ren_columns

missing_values = missing_values_table(data) #缺失值統計
missing_values.head(20)

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

#直接刪除

#隨機插補法
data=data.fillna(0)

#均值或中位數替換法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])

#建模法
LightGBM和XGBoost都能將NaN作為數據的一部分進行學習,所以不需要處理缺失值

特徵處理

數據類型

data.dtypes.value_counts() #列類型
data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列

data=data.drop([id_name],axis=1) #刪除姓名變數

時間型數據

  • 連續型:時間差
  • 離散型:年月日小時、季度、周數、天數、星期幾、工作日/週末、上中下午

#提取不帶分隔符的年月日
data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce)
data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce)
data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)

#提取帶分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]

#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year

/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)

data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day

data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day

#判斷是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

#計算時間差 時間序列類型數據
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)

#刪除原來的日期變數
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)

文本型數據

Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding (自然數編碼)
  • One-hot encoding (獨熱編碼)
  • mean encoding (平均數編碼)
  • 聚類編碼

Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

For any categorical variable (dtype == object) with 2 unique categories, we will use label encoding. For label encoding, we use the Scikit-Learn LabelEncoder.

from sklearn.preprocessing import LabelEncoder #分類變數預處理

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in data:
if data[col].dtype == object:
# If 2 or fewer unique categories
if len(list(data[col].unique())) <= 2:
# Train
le.fit(data[col])
# Transform
data[col] = le.transform(data[col])

# Keep track of how many columns were label encoded
le_count += 1

print(%d columns were label encoded. % le_count)

One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.

For categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

For any categorical variable with more than 2 unique categories, we will use one-hot encoding. For one-hot encoding, the pandas get_dummies(df)

data = pd.get_dummies(data) #uint8類型數據

The number of features has grown significantly due to one-hot encoding. At some point we probably will want to try dimensionality reduction (removing features that are not relevant) to reduce the size of the datasets.

數值型數據

  • 離散化:把連續的值分成不同的段,每一段列成特徵,從而分段處理
  • 數據轉換:log
  • 歸一化:
  • 標準化:
  • 正態化:

特徵構造

數值特徵的簡單變換

  • 線性組合
  • 多項式特徵
  • 比例特徵

類別特徵與數值特徵的組合

用基因編程創造新特徵

用決策樹創造新特徵

業務知識特徵

Domain Knowledge Features: we can make a couple features that attempt to capture what we think may be important for telling whether a client will default on a loan.

異常值

data[days_between_query_and_loans].describe()

data[days_anomalies] = data[days_between_query_and_loans]==異常值
print(There are %d anomalies in the test data out of %d entries
% (data[days_anomalies].sum(), len(data)))

#3σ原則

#箱型圖分析

#距離法

#聚類法

Handling the anomalies depends on the exact situation, with no set rules. One of the safest approaches is just to set the anomalies to a missing value and then have them filled in (using Imputation) before machine learning. As a solution, we will fill in the anomalous values with not a number (np.nan) and then create a new boolean column indicating whether or not the value was anomalous.

data[days_between_query_and_loans].replace({異常值: np.nan}, inplace = True)

重複值

特徵工程

For structured data, all tend to be variants on gradient boosting. Feature engineering has a greater return on investment than model building and hyperparameter tuning. As Andrew Ng is fond of saying: "applied machine learning is basically feature engineering." 坊間常說:數據和特徵決定了機器學習的上限,而模型和演算法只是逼近這個上限而已。 冗餘:部分特徵的相關度太高了,消耗計算性能 雜訊:部分特徵是對預測結果有負影響 garbage in, garbage out

特徵選擇

Feature selection: choosing only the most important features or other methods of dimensionality reduction.

Filter

過濾型:評估單個特徵和結果值之間的相關程度,排序留下Top相關的特徵部分。而計算相關程度可以用Pearson相關係數、互信息、距離相關度來計算。 這種方法的缺點是:沒有考慮到特徵之間的關聯作用,可能把有用的關聯特徵誤踢掉。

  • 相關係數法
  • 方差選擇法
  • 卡方檢驗
  • 信息增益、互信息:找出信息增益最高的幾個特徵。祛除存在線性關係中的特徵

#無變化變數,相關性為0
sourceid NaN
bank_card_no NaN
first_transaction_time_is_weekend NaN

Wrapper

包裹型:把特徵選擇看做一個特徵子集搜索問題,篩選各種特徵子集,用模型評估效果。

  • 遞歸特徵消除法

Embedded

嵌入型:是指根據模型來分析特徵的重要性(有別於上面的方式, 是從生產的模型權重等)。

  • 基於懲罰項的特徵選擇法
  • 基於樹模型的特徵選擇法

特徵提取

Feature extraction: constructing new features from the existing data.

  • PCA
  • LDA

建模

樣本平衡處理

X = data.iloc[:,0:-1]
y = data.iloc[:,-1]
y.value_counts() #目標列分佈
y.astype(int).plot.hist()

From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid. Once we get into more sophisticated machine learning models, we can weight the classes by their representation in the data to reflect this imbalance.

  • 重複採樣
  • 生成對抗網路

對齊訓練集和測試集

數據泄露:訓練時同時誤用了測試數據進行特徵工程,也叫做數據泄露(data leakage)。

時間型特徵(特徵數量不一致)

文本型特徵(特徵數量不一致)

There need to be the same features (columns) in both the training and testing data. One-hot encoding has created more columns in the training data because there were some categorical variables with categories not represented in the testing data. To remove the columns in the training data that are not in the testing data, we need to align the dataframes. 測試集中出現了訓練集中未出現的類別。如果此時做獨熱編碼(one-hot encoding),那麼就會出現bug,除非單獨賦予一個類別叫做「未知」。

train_labels = app_train[TARGET]

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = inner, axis = 1)

# Add the target back in
app_train[TARGET] = train_labels

print(Training Features shape: , app_train.shape)
print(Testing Features shape: , app_test.shape)

數值型特徵(預處理不一致)

對訓練集的特徵做歸一化後,測試集的特徵怎麼辦?這是一個非常關鍵的問題,因為訓練集特徵歸一化後,測試集的特徵範圍可能就不同了,因此模型失效。一般有幾種思路: 方法1:把訓練集和測試集合在一起做歸一化,這樣特徵範圍就統一了。之後用訓練集做訓練,那測試集做測試。但很明顯的,在訓練模型時,不應該包括任何測試集的信息。這種做法會導致存在人為偏差的模型,不能用。 方法2:對訓練集單獨做歸一化,之後對測試集單獨做歸一化。這種看法看似也可以,重點在於數據量以及數據的排列順序。在數據量大且數據被充分打亂的前提下,這種做法是可行的。但換句話說,如果有這樣的前提假設,那麼方法1的結論也是可行的。 方法3:對訓練集先做歸一化,並保留其歸一化參數(如最大、最小值),之後用訓練集的歸一化參數對測試集做處理。這種做法看似是可以的。但風險在於數據量有限的前提下,訓練集的參數會導致測試集的結果異常,如產生極大或者極小的數值。 其實不難看出,從某種意義上說,三種做法是等價的。在數據量大且充分打亂的前提下,訓練集和驗證集有相同的分佈假設,因此用任意一種其實差別不大。然而這樣的假設過於樂觀,且我們在真實情況下應該只有{訓練集+1個測試數據},因此方法2是明顯不行的。 於是似乎方法1和方法3都可以。但其實不然,方法1常常被認為是錯誤的操作,原因是在訓練階段引入了測試數據,這屬於未知數據。即使僅僅引入了1個測試數據,如果取值非常極端,依然會導致輸出範圍有較大的波動。其次,如果對於每一個測試數據都需要用整個訓練集來歸一的話,那麼運算開銷會非常大。 那麼似乎備選的只有方案3,即保留驗證集上的歸一化參數,並運用於測試集。這樣的做法看似可以,但有不少風險: 不是每種特徵工程都可以保存參數,很多特徵工程是非常繁複的。 如果測試集數據和訓練集數據有很大的差別,那麼用測試集的參數會產生異常數據。

建模預測劃分

from sklearn.model_selection import train_test_split

# 避免過擬合,採用交叉驗證,驗證集佔訓練集20%,固定隨機種子(random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=0)

X_train
X_test
y_train
y_test

建模預測評估

K近鄰

from sklearn.neighbors import KNeighborsClassifier

#尋找最優鄰居個數
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
training_accuracy.append(clf.score(X_train, y_train))
test_accuracy.append(clf.score(X_test, y_test))

plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

#最優鄰居個數為6
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

邏輯斯蒂回歸與softmax回歸

#線性回歸

#邏輯斯蒂回歸
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=100, penalty="l1")
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#softmax回歸

決策樹與隨機森林

#決策樹
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=4, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#隨機森林
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

提升

優先選擇對於特徵工程要求低的模型,如xgboost等

#Adaptive Boosting(AdaBoost)
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,algorithm="SAMME.R", learning_rate=0.5)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#Gradient Boosting Decision Tree(GBDT)
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(random_state=0, max_depth=1)
#clf = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#XGBoost
import xgboost as xgb

#LightGBM
import lightgbm as lgb

支持向量機

#線性支持向量機
from sklearn.svm import LinearSVC

clf = LinearSVC()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#非線性(核)支持向量機
from sklearn.svm import SVC

clf = SVC(kernel=rbf, C=10, gamma=0.1)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

樸素貝葉斯

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))


成長感悟:

學習初期一直以為機器學習原理和建模調參是整個學習的重點,通過一套流程走下來,發現實際中在建模之前可能就會遇到各種各樣的問題,所以需要花很大力氣進行數據清洗和特徵工程,終於理解為什麼前人總是會說「數據和特徵決定了機器學習的上限,而模型和演算法只是逼近這個上限而已。」「garbage in, garbage out.」以及為什麼大神們總是強調要深入「domain knowledge」即所謂「業務」。

「learning by doing」是一個很好的學習方法,在實際操作中你可能會遇到很多自己預想不到的問題以及如何解決這些問題便是對個人學習能力的考量了,所謂置之死地而後生,每當學習很痛苦的時候也是知識結構及認知又一次巔峯跳躍的時候,回頭看的時候會發現原來自己已經走了這麼遠了啊。

「善始者實繁,克終者蓋寡。」「你實踐了,你就贏在起跑線;你堅持了,你就贏在終點線。」

共勉。


參考資料:

Start Here: A Gentle Introduction

【乾貨合集】一文讀懂特徵工程

人工智慧之路(特徵工程篇)

【持續更新】機器學習特徵工程實用技巧大全

警惕「特徵工程」中的陷阱

apachecn/feature-engineering-for-ml-zh

推薦閱讀:

相關文章