风控建模实战—Home Credit Default Risk

Introduction

This notebook is a simulation of machine learning competition in kagggle "Home Credit Default Risk" with actual data. I purposely get a gentle introduction to commonly used methods in machine learning competitions in order to show the basics of how to get started in machine learning. Any comments or suggestions are much appreciated.

研究目标

The objective of this competition is to use historical loan application data to predict whether or not a client will repay a loan or have difficulty.

Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

评价指标

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate
The Area Under the Curve (AUC) explains itself by its name! It is simply the area under the ROC curve

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or theF1 scoreto more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the ROC AUC is a better representation of model performance.

导入相关模块

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

数据读取

data=pd.read_csv(C:/Users/JJJ/Desktop/Jane/data.csv,encoding=gbk ) data.head() #前5行 data.shape #行列数 data.info()

数据清洗

缺失值

# Function to calculate missing values by column def missing_values_table(df): # Total missing values mis_val = df.isnull().sum()

# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)

# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : Missing Values, 1 : % of Total Values})

# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
% of Total Values, ascending=False).round(1)

# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) +
"columns.
"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")

# Return the dataframe with missing information
return mis_val_table_ren_columns

missing_values = missing_values_table(data) #缺失值统计
missing_values.head(20)

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can handle missing values with no need for imputation. Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

#直接删除

#随机插补法
data=data.fillna(0)

#均值或中位数替换法
from sclera.preprocessing import Imputer
#mean,median,most_frequent
imputer=Imputer(strategy=mean)
imputer.fit_transform(data[[需填充列的列名]])

#建模法
LightGBM和XGBoost都能将NaN作为数据的一部分进行学习，所以不需要处理缺失值

特征处理

数据类型

data.dtypes.value_counts() #列类型 data.select_dtypes(object).apply(pd.Series.nunique,axis=0) #object列

data=data.drop([id_name],axis=1) #删除姓名变数

时间型数据

连续型：时间差
离散型：年月日小时、季度、周数、天数、星期几、工作日/周末、上中下午

#提取不带分隔符的年月日 data[year_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[0:4], errors=coerce) data[month_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[4:6], errors=coerce) data[day_of_first_transaction_time]=pd.to_numeric(data[first_transaction_time].astype(str).str[6:8], errors=coerce)

#提取带分隔符的年月日
data[year_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[0]
data[month_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[1]
data[day_of_latest_query_time]=data[latest_query_time].str.split(/,expand=True)[2]

#法二
from datetime import datetime
data[year_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.year

/pd.to_datetime(data[latest_query_time]).apply/map(lambda x:x.year)

data[month_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.month
data[day_of_latest_query_time]=pd.to_datetime(data[latest_query_time]).dt.day

data[year_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.year
data[month_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.month
data[day_of_loans_latest_time]=pd.to_datetime(data[loans_latest_time]).dt.day

#判断是否是工作日
day_of_week=pd.to_datetime(data[first_transaction_time]).dt.weekday+1
data[first_transaction_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[latest_query_time]).dt.weekday+1
data[latest_query_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

day_of_week=pd.to_datetime(data[loans_latest_time]).dt.weekday+1
data[loans_latest_time_is_weekend]=((day_of_week==6)|(day_of_week==7))*1

#计算时间差时间序列类型数据
data[days_between_query_and_loans]=
pd.to_datetime(data[latest_query_time]).dt.day - pd.to_datetime(data[loans_latest_time]).dt.day

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).dt.days

/(pd.to_datetime(data[latest_query_time]) - pd.to_datetime(data[loans_latest_time])).apply/map(lambda x:x.days)

#删除原来的日期变数
data=data.drop([first_transaction_time,latest_query_time,loans_latest_time],axis=1)

文本型数据

Most of the categorical variables have a relatively sma number of unique entries. We will need to find a way to deal with these categorical variables. A machine learning model unfortunately cannot deal with categorical variables (except for some models such as LightGBM). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

Label encoding (自然数编码)
One-hot encoding (独热编码)
mean encoding (平均数编码)
聚类编码

Label encoding: assign each unique category in a categorical variable with an integer. (No new columns are created)

风控建模实战—Home Credit Default Risk

Introduction

研究目标

评价指标

导入相关模块

数据读取

数据清洗

缺失值

特征处理

特征构造

异常值

重复值

特征工程

特征选择

建模

样本平衡处理

对齐训练集和测试集

建模预测划分

建模预测评估

K近邻

逻辑斯蒂回归与softmax回归

决策树与随机森林

提升

支持向量机

朴素贝叶斯

热门新闻

周热门

风控建模实战—Home Credit Default Risk

Introduction

研究目标

评价指标

导入相关模块

数据读取

数据清洗

缺失值

特征处理

特征构造

异常值

重复值

特征工程

特征选择

建模

样本平衡处理

对齐训练集和测试集

建模预测划分

建模预测评估

K近邻

逻辑斯蒂回归与softmax回归

决策树与随机森林

提升

支持向量机

朴素贝叶斯

比特币近年来多次出现安全漏洞，损失金额巨大，那么区块链资产真的安全吗？

防范化解重大风险，这里的风险指的是哪些内容？

风控审批岗如何转到风控分析岗？

花呗和信用卡，哪个更安全? 风控更严格？

现在p2p行业怎么辨别风险？

外汇的风控专员主要是干什么的？

一张信用卡每个月在固定商店刷卡会风控吗？

收到交行信用卡风控简讯怎么办？

我刚申请了信用卡一次性刷完会被风控吗?

自动驾驶车如何降低被黑客控制的风险？

移动支付有哪些风险和弊端？

如何分析股票的风险，从而保护投资者的利益？

如何看待央行要求 2018 年起支付机构客户备付金集中交存比例提高至 50%?

汽车金融公司一证贷款模式如何控制风险 ？

车贷公司、汽车租赁公司如何防范骗车、盗车风险？

热门新闻

周热门

汽车金融公司一证贷款模式如何控制风险？