Introduction

This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.

研究目标

The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.

Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

评价指标

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

  • The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
  • The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.

导入相关模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

数据读取

data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk )
data.head() #前5行
data.shape #行列数
data.info()

数据清洗

缺失值

# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()

# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)

# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})

# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)

# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")

# Return the dataframe with missing information
return mis_val_table_ren_columns

missing_values = missing_values_table(data) #缺失值统计
missing_values.head(20)

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

#直接删除

#随机插补法
data=data.fillna(0)

#均值或中位数替换法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])

#建模法
LightGBM和XGBoost都能将NaN作为数据的一部分进行学习,所以不需要处理缺失值

特征处理

数据类型

data.dtypes.value_counts() #列类型
data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列

data=data.drop([id_name],axis=1) #删除姓名变数

时间型数据

  • 连续型:时间差
  • 离散型:年月日小时、季度、周数、天数、星期几、工作日/周末、上中下午

#提取不带分隔符的年月日
data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce)
data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce)
data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)

#提取带分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]

#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year

/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)

data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day

data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day

#判断是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

#计算时间差 时间序列类型数据
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)

#删除原来的日期变数
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)

文本型数据

Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

  • Label encoding (自然数编码)
  • One-hot encoding (独热编码)
  • mean encoding (平均数编码)
  • 聚类编码

Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

For any categorical variable (dtype == object) with 2 unique categories, we will use label encoding. For label encoding, we use the Scikit-Learn LabelEncoder.

from sklearn.preprocessing import LabelEncoder #分类变数预处理

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in data:
if data[col].dtype == object:
# If 2 or fewer unique categories
if len(list(data[col].unique())) <= 2:
# Train
le.fit(data[col])
# Transform
data[col] = le.transform(data[col])

# Keep track of how many columns were label encoded
le_count += 1

print(%d columns were label encoded. % le_count)

One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns.

For categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

For any categorical variable with more than 2 unique categories, we will use one-hot encoding. For one-hot encoding, the pandas get_dummies(df)

data = pd.get_dummies(data) #uint8类型数据

The number of features has grown significantly due to one-hot encoding. At some point we probably will want to try dimensionality reduction (removing features that are not relevant) to reduce the size of the datasets.

数值型数据

  • 离散化:把连续的值分成不同的段,每一段列成特征,从而分段处理
  • 数据转换:log
  • 归一化:
  • 标准化:
  • 正态化:

特征构造

数值特征的简单变换

  • 线性组合
  • 多项式特征
  • 比例特征

类别特征与数值特征的组合

用基因编程创造新特征

用决策树创造新特征

业务知识特征

Domain Knowledge Features: we can make a couple features that attempt to capture what we think may be important for telling whether a client will default on a loan.

异常值

data[days_between_query_and_loans].describe()

data[days_anomalies] = data[days_between_query_and_loans]==异常值
print(There are %d anomalies in the test data out of %d entries
% (data[days_anomalies].sum(), len(data)))

#3σ原则

#箱型图分析

#距离法

#聚类法

Handling the anomalies depends on the exact situation, with no set rules. One of the safest approaches is just to set the anomalies to a missing value and then have them filled in (using Imputation) before machine learning. As a solution, we will fill in the anomalous values with not a number (np.nan) and then create a new boolean column indicating whether or not the value was anomalous.

data[days_between_query_and_loans].replace({异常值: np.nan}, inplace = True)

重复值

特征工程

For structured data, all tend to be variants on gradient boosting. Feature engineering has a greater return on investment than model building and hyperparameter tuning. As Andrew Ng is fond of saying: "applied machine learning is basically feature engineering." 坊间常说:数据和特征决定了机器学习的上限,而模型和演算法只是逼近这个上限而已。 冗余:部分特征的相关度太高了,消耗计算性能 杂讯:部分特征是对预测结果有负影响 garbage in, garbage out

特征选择

Feature selection: choosing only the most important features or other methods of dimensionality reduction.

Filter

过滤型:评估单个特征和结果值之间的相关程度,排序留下Top相关的特征部分。而计算相关程度可以用Pearson相关系数、互信息、距离相关度来计算。 这种方法的缺点是:没有考虑到特征之间的关联作用,可能把有用的关联特征误踢掉。

  • 相关系数法
  • 方差选择法
  • 卡方检验
  • 信息增益、互信息:找出信息增益最高的几个特征。祛除存在线性关系中的特征

#无变化变数,相关性为0
sourceid NaN
bank_card_no NaN
first_transaction_time_is_weekend NaN

Wrapper

包裹型:把特征选择看做一个特征子集搜索问题,筛选各种特征子集,用模型评估效果。

  • 递归特征消除法

Embedded

嵌入型:是指根据模型来分析特征的重要性(有别于上面的方式, 是从生产的模型权重等)。

  • 基于惩罚项的特征选择法
  • 基于树模型的特征选择法

特征提取

Feature extraction: constructing new features from the existing data.

  • PCA
  • LDA

建模

样本平衡处理

X = data.iloc[:,0:-1]
y = data.iloc[:,-1]
y.value_counts() #目标列分布
y.astype(int).plot.hist()

From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid. Once we get into more sophisticated machine learning models, we can weight the classes by their representation in the data to reflect this imbalance.

  • 重复采样
  • 生成对抗网路

对齐训练集和测试集

数据泄露:训练时同时误用了测试数据进行特征工程,也叫做数据泄露(data leakage)。

时间型特征(特征数量不一致)

文本型特征(特征数量不一致)

There need to be the same features (columns) in both the training and testing data. One-hot encoding has created more columns in the training data because there were some categorical variables with categories not represented in the testing data. To remove the columns in the training data that are not in the testing data, we need to align the dataframes. 测试集中出现了训练集中未出现的类别。如果此时做独热编码(one-hot encoding),那么就会出现bug,除非单独赋予一个类别叫做「未知」。

train_labels = app_train[TARGET]

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = inner, axis = 1)

# Add the target back in
app_train[TARGET] = train_labels

print(Training Features shape: , app_train.shape)
print(Testing Features shape: , app_test.shape)

数值型特征(预处理不一致)

对训练集的特征做归一化后,测试集的特征怎么办?这是一个非常关键的问题,因为训练集特征归一化后,测试集的特征范围可能就不同了,因此模型失效。一般有几种思路: 方法1:把训练集和测试集合在一起做归一化,这样特征范围就统一了。之后用训练集做训练,那测试集做测试。但很明显的,在训练模型时,不应该包括任何测试集的信息。这种做法会导致存在人为偏差的模型,不能用。 方法2:对训练集单独做归一化,之后对测试集单独做归一化。这种看法看似也可以,重点在于数据量以及数据的排列顺序。在数据量大且数据被充分打乱的前提下,这种做法是可行的。但换句话说,如果有这样的前提假设,那么方法1的结论也是可行的。 方法3:对训练集先做归一化,并保留其归一化参数(如最大、最小值),之后用训练集的归一化参数对测试集做处理。这种做法看似是可以的。但风险在于数据量有限的前提下,训练集的参数会导致测试集的结果异常,如产生极大或者极小的数值。 其实不难看出,从某种意义上说,三种做法是等价的。在数据量大且充分打乱的前提下,训练集和验证集有相同的分布假设,因此用任意一种其实差别不大。然而这样的假设过于乐观,且我们在真实情况下应该只有{训练集+1个测试数据},因此方法2是明显不行的。 于是似乎方法1和方法3都可以。但其实不然,方法1常常被认为是错误的操作,原因是在训练阶段引入了测试数据,这属于未知数据。即使仅仅引入了1个测试数据,如果取值非常极端,依然会导致输出范围有较大的波动。其次,如果对于每一个测试数据都需要用整个训练集来归一的话,那么运算开销会非常大。 那么似乎备选的只有方案3,即保留验证集上的归一化参数,并运用于测试集。这样的做法看似可以,但有不少风险: 不是每种特征工程都可以保存参数,很多特征工程是非常繁复的。 如果测试集数据和训练集数据有很大的差别,那么用测试集的参数会产生异常数据。

建模预测划分

from sklearn.model_selection import train_test_split

# 避免过拟合,采用交叉验证,验证集占训练集20%,固定随机种子(random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=0)

X_train
X_test
y_train
y_test

建模预测评估

K近邻

from sklearn.neighbors import KNeighborsClassifier

#寻找最优邻居个数
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=n_neighbors)
clf.fit(X_train, y_train)
training_accuracy.append(clf.score(X_train, y_train))
test_accuracy.append(clf.score(X_test, y_test))

plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

#最优邻居个数为6
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

逻辑斯蒂回归与softmax回归

#线性回归

#逻辑斯蒂回归
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=100, penalty="l1")
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#softmax回归

决策树与随机森林

#决策树
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=4, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#随机森林
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

提升

优先选择对于特征工程要求低的模型,如xgboost等

#Adaptive Boosting(AdaBoost)
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200,algorithm="SAMME.R", learning_rate=0.5)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#Gradient Boosting Decision Tree(GBDT)
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(random_state=0, max_depth=1)
#clf = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#XGBoost
import xgboost as xgb

#LightGBM
import lightgbm as lgb

支持向量机

#线性支持向量机
from sklearn.svm import LinearSVC

clf = LinearSVC()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

#非线性(核)支持向量机
from sklearn.svm import SVC

clf = SVC(kernel=rbf, C=10, gamma=0.1)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

朴素贝叶斯

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))


成长感悟:

学习初期一直以为机器学习原理和建模调参是整个学习的重点,通过一套流程走下来,发现实际中在建模之前可能就会遇到各种各样的问题,所以需要花很大力气进行数据清洗和特征工程,终于理解为什么前人总是会说「数据和特征决定了机器学习的上限,而模型和演算法只是逼近这个上限而已。」「garbage in, garbage out.」以及为什么大神们总是强调要深入「domain knowledge」即所谓「业务」。

「learning by doing」是一个很好的学习方法,在实际操作中你可能会遇到很多自己预想不到的问题以及如何解决这些问题便是对个人学习能力的考量了,所谓置之死地而后生,每当学习很痛苦的时候也是知识结构及认知又一次巅峰跳跃的时候,回头看的时候会发现原来自己已经走了这么远了啊。

「善始者实繁,克终者盖寡。」「你实践了,你就赢在起跑线;你坚持了,你就赢在终点线。」

共勉。


参考资料:

Start Here: A Gentle Introduction

【干货合集】一文读懂特征工程

人工智慧之路(特征工程篇)

【持续更新】机器学习特征工程实用技巧大全

警惕「特征工程」中的陷阱

apachecn/feature-engineering-for-ml-zh

推荐阅读:

相关文章