风控建模实战—Home Credit Default Risk
Introduction
This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.
研究目标
The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.
Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)
评价指标
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
- The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
- The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve
When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.
导入相关模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
数据读取
data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk )
data.head() #前5行
data.shape #行列数
data.info()
数据清洗
缺失值
# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
missing_values = missing_values_table(data) #缺失值统计
missing_values.head(20)
When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.
#直接删除
#随机插补法
data=data.fillna(0)
#均值或中位数替换法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])
#建模法
LightGBM和XGBoost都能将NaN作为数据的一部分进行学习,所以不需要处理缺失值
特征处理
数据类型
data.dtypes.value_counts() #列类型
data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列
data=data.drop([id_name],axis=1) #删除姓名变数
时间型数据
- 连续型:时间差
- 离散型:年月日小时、季度、周数、天数、星期几、工作日/周末、上中下午
#提取不带分隔符的年月日
data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce)
data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce)
data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)
#提取带分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]
#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year
/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)
data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day
data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day
#判断是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
#计算时间差 时间序列类型数据
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day
/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days
/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)
#删除原来的日期变数
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)
文本型数据
Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:
- Label encoding (自然数编码)
- One-hot encoding (独热编码)
- mean encoding (平均数编码)
- 聚类编码
Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)