風控建模實戰—Home Credit Default Risk
Introduction
This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.
研究目標
The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.
Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)
評價指標
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
- The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
- The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve
When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.
導入相關模塊
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
數據讀取
data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk )
data.head() #前5行
data.shape #行列數
data.info()
數據清洗
缺失值
# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
missing_values = missing_values_table(data) #缺失值統計
missing_values.head(20)
When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.
#直接刪除
#隨機插補法
data=data.fillna(0)
#均值或中位數替換法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])
#建模法
LightGBM和XGBoost都能將NaN作為數據的一部分進行學習,所以不需要處理缺失值
特徵處理
數據類型
data.dtypes.value_counts() #列類型
data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列
data=data.drop([id_name],axis=1) #刪除姓名變數
時間型數據
- 連續型:時間差
- 離散型:年月日小時、季度、周數、天數、星期幾、工作日/週末、上中下午
#提取不帶分隔符的年月日
data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce)
data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce)
data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)
#提取帶分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]
#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year
/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)
data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day
data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day
#判斷是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1
#計算時間差 時間序列類型數據
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day
/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days
/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)
#刪除原來的日期變數
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)
文本型數據
Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:
- Label encoding (自然數編碼)
- One-hot encoding (獨熱編碼)
- mean encoding (平均數編碼)
- 聚類編碼
Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)