風控建模實戰—Home Credit Default Risk

Introduction

This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.

研究目標

The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.

Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

評價指標

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.

導入相關模塊

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

數據讀取

data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk ) data.head() #前5行 data.shape #行列數 data.info()

數據清洗

缺失值

# Function to calculate missing values by column def missing_values_table(df): # Total missing values mis_val = df.isnull().sum()

# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)

# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})

# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)

# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")

# Return the dataframe with missing information
return mis_val_table_ren_columns

missing_values = missing_values_table(data) #缺失值統計
missing_values.head(20)

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

#直接刪除

#隨機插補法
data=data.fillna(0)

#均值或中位數替換法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])

#建模法
LightGBM和XGBoost都能將NaN作為數據的一部分進行學習，所以不需要處理缺失值

特徵處理

數據類型

data.dtypes.value_counts() #列類型 data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列

data=data.drop([id_name],axis=1) #刪除姓名變數

時間型數據

連續型：時間差
離散型：年月日小時、季度、周數、天數、星期幾、工作日/週末、上中下午

#提取不帶分隔符的年月日 data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce) data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce) data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)

#提取帶分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]

#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year

/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)

data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day

data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day

#判斷是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

#計算時間差時間序列類型數據
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)

#刪除原來的日期變數
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)

文本型數據

Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

Label encoding (自然數編碼)
One-hot encoding (獨熱編碼)
mean encoding (平均數編碼)
聚類編碼

Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)

風控建模實戰—Home Credit Default Risk

Introduction

研究目標

評價指標

導入相關模塊

數據讀取

數據清洗

缺失值

特徵處理

特徵構造

異常值

重複值

特徵工程

特徵選擇

建模

樣本平衡處理

對齊訓練集和測試集

建模預測劃分

建模預測評估

K近鄰

邏輯斯蒂回歸與softmax回歸

決策樹與隨機森林

提升

支持向量機

樸素貝葉斯

熱門新聞

週熱門

風控建模實戰—Home Credit Default Risk

Introduction

研究目標

評價指標

導入相關模塊

數據讀取

數據清洗

缺失值

特徵處理

特徵構造

異常值

重複值

特徵工程

特徵選擇

建模

樣本平衡處理

對齊訓練集和測試集

建模預測劃分

建模預測評估

K近鄰

邏輯斯蒂回歸與softmax回歸

決策樹與隨機森林

提升

支持向量機

樸素貝葉斯

比特幣近年來多次出現安全漏洞，損失金額巨大，那麼區塊鏈資產真的安全嗎？

防範化解重大風險，這裡的風險指的是哪些內容？

風控審批崗如何轉到風控分析崗？

花唄和信用卡，哪個更安全? 風控更嚴格？

現在p2p行業怎麼辨別風險？

外匯的風控專員主要是幹什麼的？

一張信用卡每個月在固定商店刷卡會風控嗎？

收到交行信用卡風控簡訊怎麼辦？

我剛申請了信用卡一次性刷完會被風控嗎?

自動駕駛車如何降低被黑客控制的風險？

移動支付有哪些風險和弊端？

如何分析股票的風險，從而保護投資者的利益？

如何看待央行要求 2018 年起支付機構客戶備付金集中交存比例提高至 50%?

汽車金融公司一證貸款模式如何控制風險 ？

車貸公司、汽車租賃公司如何防範騙車、盜車風險？

熱門新聞

週熱門

汽車金融公司一證貸款模式如何控制風險？