基于LightGBM的金融信贷数据建模

发布时间:2024年01月19日

公众号:尤而小屋
作者:Peter
编辑:Peter

大家好,我是Peter~

本文是UCI金融信贷数据集的第二篇文章:基于LightGBM的二分类建模。主要内容包含:

  • 数据基本信息
  • 缺失值信息
  • 不同字段的统计信息
  • 目标变量的不均衡性
  • 变量间的相关性分析
  • 基于QQ图的字段的正态检验
  • 数据预处理(编码、归一化、降维等)
  • 分类模型评估标准
  • 基于LightGBM建立模型

1 导入库

第一步还是导入数据处理和建模所需要的各种库:

In [1]:

import pandas as pd 
import numpy as np
pd.set_option('display.max_columns', 100)
from IPython.display import display_html


import plotly_express as px
import plotly.graph_objects as go

import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] # 设置字体
plt.rcParams["axes.unicode_minus"]=False # 解决“-”负号的乱码问题

import seaborn as sns
%matplotlib inline 

import missingno as ms 
import gc

from datetime import datetime 
from sklearn.model_selection import train_test_split,StratifiedKFold,GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import KMeansSMOTE, SMOTE
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, auc
from sklearn.metrics import roc_auc_score,precision_recall_curve, confusion_matrix,classification_report

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb

from scipy import stats

import warnings 
warnings.filterwarnings("ignore")

2 导入数据

In [2]:

df = pd.read_csv("UCI.csv")

df.head()

Out[2]:

3 数据基本信息

1、整体数据量

整理的数据量大小:30000条记录,25个字段信息

In [3]:

df.shape

Out[3]:

(30000, 25)

2、数据字段信息

In [4]:

df.columns  # 全部的字段名

Out[4]:

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

不同的字段类型统计:

In [5]:

df.dtypes  # 查看数据的字段类型

Out[5]:

ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                       int64
MARRIAGE                        int64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                     float64
BILL_AMT2                     float64
BILL_AMT3                     float64
BILL_AMT4                     float64
BILL_AMT5                     float64
BILL_AMT6                     float64
PAY_AMT1                      float64
PAY_AMT2                      float64
PAY_AMT3                      float64
PAY_AMT4                      float64
PAY_AMT5                      float64
PAY_AMT6                      float64
default.payment.next.month      int64
dtype: object

In [6]:

pd.value_counts(df.dtypes)  # 统计不同类型的个数

Out[6]:

float64    13
int64      12
Name: count, dtype: int64

从结果中能够看到全部是数值型字段,几乎各占一半。最后一个字段default.payment.next.month是我们最终的目标字段。

下面对字段名称的具体含义进行解释:

  • ID:ID唯一值
  • LIMIT_BAL:可透支金额(新台币计算,包含个人或者家庭)
  • SEX:性别:1-男, 2-女
  • EDUCATION:1-研究生;2-本科;3-高中;4-其他;0/5/6-未知
  • MARRIAGE:婚姻状态;1-已婚,2-单身;3-其他
  • AGE:年龄
  • PAY_0:2005年9月的还款状态(-2-未消费,-1-按时还款, 1-延迟一个月还款, 2-延迟两个月还款,…,8-延迟8个月还款, 9-延迟9个月还款)
  • PAY_2:2005年8月的还款状态(同上)
  • PAY_3:2005年7月的还款状态(同上)
  • PAY_4:2005年6月的还款状态(同上)
  • PAY_5:2005年5月的还款状态(同上)
  • PAY_6:2005年4月的还款状态(同上)
  • BILL_AMT1:2005年9月的账单金额
  • BILL_AMT2:2005年8月的账单金额
  • BILL_AMT3:2005年7月的账单金额
  • BILL_AMT4:2005年6月的账单金额
  • BILL_AMT5:2005年5月的账单金额
  • BILL_AMT6:2005年4月的账单金额
  • PAY_AMT1:2005年9月之前的付款金额
  • PAY_AMT2:2005年8月之前的付款金额
  • PAY_AMT3:2005年7月之前的付款金额
  • PAY_AMT4:2005年6月之前的付款金额
  • PAY_AMT5:2005年5月之前的付款金额
  • PAY_AMT6:2005年4月之前的付款金额
  • default.payment.next.month:最终目标变量,下个月还款违约情况(1-是,逾期;0-否,未逾期)

说明内容:

  1. PAY_ATM如果低于银行规定的最低还款额,则视为违约;
  2. PAY_ATM如果大于上月账单金额BILL_AMT,则视为及时还;
  3. PAY_AMT如果大于最低还款额但低于上月账单金额,则视为延迟还款。

3、数据的描述统计信息(展示部分字段)

In [7]:

df.describe().T  # 字段较多,转置后显示更直观  

4、字段整体信息

In [8]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   30000 non-null  float64
 14  BILL_AMT3                   30000 non-null  float64
 15  BILL_AMT4                   30000 non-null  float64
 16  BILL_AMT5                   30000 non-null  float64
 17  BILL_AMT6                   30000 non-null  float64
 18  PAY_AMT1                    30000 non-null  float64
 19  PAY_AMT2                    30000 non-null  float64
 20  PAY_AMT3                    30000 non-null  float64
 21  PAY_AMT4                    30000 non-null  float64
 22  PAY_AMT5                    30000 non-null  float64
 23  PAY_AMT6                    30000 non-null  float64
 24  default.payment.next.month  30000 non-null  int64  
dtypes: float64(13), int64(12)
memory usage: 5.7 MB

为了数据处理方便,将原始的default.payment.next.month字段重新命名成Label:

In [9]:

df.rename(columns={"default.payment.next.month":"Label"},inplace=True)

4 缺失值

4.1 缺失值统计

统计每个字段的缺失值个数:

In [10]:

df.isnull().sum().sort_values(ascending=False)

Out[10]:

ID           0
BILL_AMT2    0
PAY_AMT6     0
PAY_AMT5     0
PAY_AMT4     0
PAY_AMT3     0
PAY_AMT2     0
PAY_AMT1     0
BILL_AMT6    0
BILL_AMT5    0
BILL_AMT4    0
BILL_AMT3    0
BILL_AMT1    0
LIMIT_BAL    0
PAY_6        0
PAY_5        0
PAY_4        0
PAY_3        0
PAY_2        0
PAY_0        0
AGE          0
MARRIAGE     0
EDUCATION    0
SEX          0
Label        0
dtype: int64

In [11]:

# 缺失值个数
total = df.isnull().sum().sort_values(ascending=False)

In [12]:

# 缺失值比例
percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False) 

percent

Out[12]:

ID           0.0
BILL_AMT2    0.0
PAY_AMT6     0.0
PAY_AMT5     0.0
PAY_AMT4     0.0
PAY_AMT3     0.0
PAY_AMT2     0.0
PAY_AMT1     0.0
BILL_AMT6    0.0
BILL_AMT5    0.0
BILL_AMT4    0.0
BILL_AMT3    0.0
BILL_AMT1    0.0
LIMIT_BAL    0.0
PAY_6        0.0
PAY_5        0.0
PAY_4        0.0
PAY_3        0.0
PAY_2        0.0
PAY_0        0.0
AGE          0.0
MARRIAGE     0.0
EDUCATION    0.0
SEX          0.0
Label        0.0
dtype: float64

将个数和比例的合并,显示完整的缺失值信息:

In [13]:

pd.concat([total, percent],axis=1,keys=["Total","Percent"]).T

4.2 缺失值可视化

In [14]:

ms.bar(df,color="blue")                                                     

plt.show()

坐标轴标签的旋转:

In [15]:

# ms.matrix(df, labels=True,label_rotation=45)
# plt.show()

下面进行不同字段的详细数据探索过程:

In [16]:

df.columns

Out[16]:

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
      dtype='object')

ID字段对建模无效,直接删除:

In [17]:

df.drop("ID",inplace=True,axis=1) 

5 统计信息

5.1 Personal Information

查看用户的个人信息,比如信用额度、学历、婚姻状态、年龄等字段:

In [18]:

df[['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()

Out[18]:

LIMIT_BALEDUCATIONMARRIAGEAGE
count30000.00000030000.00000030000.00000030000.000000
mean167484.3226671.8531331.55186735.485500
std129747.6615670.7903490.5219709.217904
min10000.0000000.0000000.00000021.000000
25%50000.0000001.0000001.00000028.000000
50%140000.0000002.0000002.00000034.000000
75%240000.0000002.0000002.00000041.000000
max1000000.0000006.0000003.00000079.000000

In [19]:

df["EDUCATION"].value_counts().sort_values(ascending=False)

Out[19]:

EDUCATION
2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: count, dtype: int64

用户的学历中出现最多的是本科生EDUCATION=2

In [20]:

df["MARRIAGE"].value_counts().sort_values(ascending=False)        

Out[20]:

MARRIAGE
2    15964
1    13659
3      323
0       54
Name: count, dtype: int64

用户的婚姻状态中,出现最多的是MARRIAGE=2,已婚人群。

5.2 LIMIT_BAL

LIMIT_BAL的分布

In [21]:

df["LIMIT_BAL"].value_counts().sort_values(ascending=False)

Out[21]:

LIMIT_BAL
50000.0      3365
20000.0      1976
30000.0      1610
80000.0      1567
200000.0     1528
             ... 
800000.0        2
1000000.0       1
327680.0        1
760000.0        1
690000.0        1
Name: count, Length: 81, dtype: int64

可以看到信用额度最为频繁的是50,000

In [22]:

plt.figure(figsize = (14,6))
plt.title('Density Plot of LIMIT_BAL')

sns.set_color_codes("pastel")
sns.distplot(df['LIMIT_BAL'],kde=True,bins=200)

plt.show()  

5.3 PAY0-PAY6

每月之前的对应还款状态:

In [23]:

df[["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]].describe()

Out[23]:

PAY_0PAY_2PAY_3PAY_4PAY_5PAY_6
count30000.00000030000.00000030000.00000030000.00000030000.00000030000.000000
mean-0.016700-0.133767-0.166200-0.220667-0.266200-0.291100
std1.1238021.1971861.1968681.1691391.1331871.149988
min-2.000000-2.000000-2.000000-2.000000-2.000000-2.000000
25%-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000
50%0.0000000.0000000.0000000.0000000.0000000.000000
75%0.0000000.0000000.0000000.0000000.0000000.000000
max8.0000008.0000008.0000008.0000008.0000008.000000

不同还款状态的对比:

In [24]:

repay = df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Label']]

repay = pd.melt(repay, 
                id_vars="Label",
                var_name="Payment Status",
                value_name="Delay(Month)"
               )
repay.head()

Out[24]:

LabelPayment StatusDelay(Month)
01PAY_02
11PAY_0-1
20PAY_00
30PAY_00
40PAY_0-1

In [25]:

fig = px.box(repay, x="Payment Status", y="Delay(Month)",color="Label")

fig.show()

5.4 BILL_AMT1-BILL_AMT6

当月的账单金额

In [26]:

df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()

Out[26]:

是否违约客户的对比:

In [27]:

df.columns

Out[27]:

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
      dtype='object')

In [28]:

BILL_AMTS = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']

plt.figure(figsize=(12,6))

for i, col in enumerate(BILL_AMTS):
    plt.subplot(2,3,i+1)
    sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red",shade=True)
    sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue",shade=True)
    
    plt.xlim(-40000, 200000)
    plt.ylabel("")
    plt.xlabel(col, fontsize=12)
    plt.legend()
    plt.tight_layout()
    
plt.show()

5.5 PAY_AMT1-PAY_AMT6

每月之前的对应付款金额

In [29]:

df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()

Out[29]:

PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6
count30000.0000003.000000e+0430000.0000030000.00000030000.00000030000.000000
mean5663.5805005.921163e+035225.681504826.0768674799.3876335215.502567
std16563.2803542.304087e+0417606.9614715666.15974415278.30567917777.465775
min0.0000000.000000e+000.000000.0000000.0000000.000000
25%1000.0000008.330000e+02390.00000296.000000252.500000117.750000
50%2100.0000002.009000e+031800.000001500.0000001500.0000001500.000000
75%5006.0000005.000000e+034505.000004013.2500004031.5000004000.000000
max873552.0000001.684259e+06896040.00000621000.000000426529.000000528666.000000

In [30]:

PAY_AMTS = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

plt.figure(figsize=(12,6))

for i, col in enumerate(PAY_AMTS):
    plt.subplot(2,3,i+1)
    sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red", shade=True)
    sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue", shade=True)
    
    plt.xlim(-10000, 70000)
    plt.ylabel("")
    plt.xlabel(col, fontsize=12)
    plt.legend()
    plt.tight_layout()
    
plt.show()

6 Label

是否发生违约(default.payment.next.month重命名为Label)的人数进行对比:

In [31]:

df["Label"].value_counts()

Out[31]:

Label
0    23364
1     6636
Name: count, dtype: int64

In [32]:

label = df["Label"].value_counts()
df_label = pd.DataFrame(label).reset_index()  

df_label

Out[32]:

Labelcount
0023364
116636

In [33]:

# plt.figure(figsize = (6,6))
# plt.title('Default = 0 & Not Default = 1')         
# sns.set_color_codes("pastel")

# sns.barplot(x = 'Label', y="count", data=df_label) 
# locs, labels = plt.xticks() 
# plt.show()

In [34]:

plt.figure(figsize = (5,5))
graph = sns.countplot(x="Label", data=df, palette=["red","blue"])

i = 0     

for p in graph.patches:
    print(type(p))
    h = p.get_height()
    percentage = round( 100 * df["Label"].value_counts()[i] / len(df),2)
    str_percentage = f"{percentage} %"
    graph.text(p.get_x()+p.get_width()/2., h - 100, str_percentage, ha="center")  
    
    i += 1
    
plt.title("class distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")

plt.show()

可以看到二者是很不均衡的。

In [35]:

# value_counts = df['Label'].value_counts()

# # 计算每个值的百分比
# percentages = value_counts / len(df)
# # 使用matplotlib绘制柱状图
# plt.bar(value_counts.index, value_counts.values)    

# # 在柱状图上添加百分比标签 
# for i, v in enumerate(percentages.values):                     
#     plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")  
    
# # 设置xy轴标签、标题
# plt.title("Class Distribution")
# plt.xticks([0,1], ["Non-Default","Default"])
# plt.xlabel("Default Payment Next Month",fontsize=12)
# plt.ylabel("Number of Clients")

# plt.show()

In [36]:

value_counts = df['Label'].value_counts()  

# 计算每个值的百分比
percentages = value_counts / len(df)
# 使用matplotlib绘制柱状图
plt.bar(value_counts.index, value_counts.values)    

# 在柱状图上添加百分比标签 
for i, v in enumerate(percentages.values):
    plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
    
# 设置xy轴标签、标题
plt.title("Class Distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")

plt.show()

7 相关性分析

7.1 相关性热力图

In [37]:

numeric = ['LIMIT_BAL','AGE','PAY_0','PAY_2',
           'PAY_3','PAY_4','PAY_5','PAY_6',
           'BILL_AMT1','BILL_AMT2','BILL_AMT3',
           'BILL_AMT4','BILL_AMT5','BILL_AMT6']  # 全部数值型字段
numeric

Out[37]:

['LIMIT_BAL',
 'AGE',
 'PAY_0',
 'PAY_2',
 'PAY_3',
 'PAY_4',
 'PAY_5',
 'PAY_6',
 'BILL_AMT1',
 'BILL_AMT2',
 'BILL_AMT3',
 'BILL_AMT4',
 'BILL_AMT5',
 'BILL_AMT6']

In [38]:

corr = df[numeric].corr()
corr.head()

Out[38]:

相关系数的热力图绘制:

In [39]:

mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(12,10))
sns.heatmap(corr,
            mask=mask,
            vmin=-1,
            vmax=1,
            center=0,
            square=True,
            cbar_kws={'shrink': .5}, 
            annot=True, 
            annot_kws={'size': 10},
            cmap="Blues")

plt.show()

7.2 变量两两关系

In [40]:

plt.figure(figsize=(12,10))

pair_plot = sns.pairplot(df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Label']], 
                         hue='Label',
                         diag_kind='kde', 
                         corner=True)

pair_plot._legend.remove()

8 正态检验-QQ图

为了检查我们的数据是否为高斯分布,我们使用一种称为分位数-分位数(QQ图)图的图形方法进行定性评估。

在QQ图中,独立变量的分位数与正态分布的预期分位数相对应。如果变量是正态分布的,QQ图中的点应该沿着45度对角线排列。

In [41]:

sns.set_color_codes('pastel')  # 设置样式
fig, axs = plt.subplots(5, 3, figsize=(18,18))  # 图像大小和子图设置

numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
           'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

i, j = 0, 0
for f in numeric:
    if j == 3:
        j = 0
        i = i + 1
    stats.probplot(df[f],  # 绘图数据:某个字段的全部取值
                   dist='norm', # 标准化
                   sparams=(df[f].mean(), df[f].std()), 
                   plot=axs[i,j])  # 子图位置
    
    axs[i,j].get_lines()[0].set_marker('.') 
    
    axs[i,j].grid() 
    axs[i,j].get_lines()[1].set_linewidth(3.0)
    j = j+1

fig.tight_layout()
axs[4,2].set_visible(False)
plt.show()

9 数据预处理

9.1 分类型数据处理

针对分类型数据的处理:

In [42]:

df["EDUCATION"].value_counts()

Out[42]:

EDUCATION
2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: count, dtype: int64

In [43]:

df["GRAD_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df["UNIVERSITY"] = (df["EDUCATION"] == 2).astype("category")
df["HIGH_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")

df.drop("EDUCATION",axis=1,inplace=True)

In [44]:

df['MALE'] = (df['SEX'] == 1).astype('category')
df.drop('SEX', axis=1, inplace=True)

In [45]:

df['MARRIED'] = (df['MARRIAGE'] == 1).astype('category')
df.drop('MARRIAGE', axis=1, inplace=True)

9.2 数据切分

In [46]:

# 划分数据

y = df['Label']
X = df.drop('Label', axis=1, inplace=False)

根据y中的类别比例进行切分:

In [47]:

# 切分数据

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=24, stratify=y)

9.3 特征归一化/标准化

最值归一化:

In [48]:

mm = MinMaxScaler()

X_train_norm = X_train_raw.copy()
X_test_norm = X_test_raw.copy()

In [49]:

# LIMIT_BAL + AGE

X_train_norm['LIMIT_BAL'] = mm.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_norm['LIMIT_BAL'] = mm.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_norm['AGE'] = mm.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_norm['AGE'] = mm.transform(X_test_raw['AGE'].values.reshape(-1, 1))

In [50]:

pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]

for pay in pay_list:
    X_train_norm[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
    X_test_norm[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))

In [51]:

for i in range(1,7):
    X_train_norm['BILL_AMT' + str(i)] = mm.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_test_norm['BILL_AMT' + str(i)] = mm.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_train_norm['PAY_AMT' + str(i)] = mm.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
    X_test_norm['PAY_AMT' + str(i)] = mm.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))

标准化过程:

In [52]:

ss = StandardScaler()
X_train_std = X_train_raw.copy()
X_test_std = X_test_raw.copy()

X_train_std['LIMIT_BAL'] = ss.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_std['LIMIT_BAL'] = ss.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_std['AGE'] = ss.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_std['AGE'] = ss.transform(X_test_raw['AGE'].values.reshape(-1, 1))

In [53]:

pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]

for pay in pay_list:
    X_train_std[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
    X_test_std[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))

In [54]:

for i in range(1,7):
    X_train_std['BILL_AMT' + str(i)] = ss.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_test_std['BILL_AMT' + str(i)] = ss.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_train_std['PAY_AMT' + str(i)] = ss.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
    X_test_std['PAY_AMT' + str(i)] = ss.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))

绘制经过编码后的数据分布:

In [55]:

sns.set_color_codes('deep')
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
           'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

fig, axs = plt.subplots(1, 2, figsize=(24,6))

sns.boxplot(data=X_train_norm[numeric], ax=axs[0])  
axs[0].set_title('Boxplot of normalized numeric features')
axs[0].set_xticklabels(labels=numeric, rotation=25)
axs[0].set_xlabel(' ')

sns.boxplot(data=X_train_std[numeric], ax=axs[1])
axs[1].set_title('Boxplot of standardized numeric features')
axs[1].set_xticklabels(labels=numeric, rotation=25)
axs[1].set_xlabel(' ')

fig.tight_layout()
plt.show()

9.4 数据降维

In [56]:

pc = len(X_train_norm.columns.values) # 25
pca = PCA(n_components=pc)  # 指定主成分个数
pca.fit(X_train_norm)

sns.reset_orig()
sns.set_color_codes('pastel') # 设置绘图颜色
plt.figure(figsize = (8,4)) # 图的大小
plt.grid()  # 网格设置
plt.title('Explained Variance of Principal Components') # 标题设置
plt.plot(pca.explained_variance_ratio_, marker='o')  # 绘制单个主成分的方差解释比例
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')  # 累计解释方差

plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])  # 图例设置
plt.xlabel('Principal Component Indexes')  # x-y轴标题
plt.ylabel('Explained Variance Ratio')  
plt.tight_layout()  # 调整布局,更紧凑
plt.axvline(12, 0, ls='--')  # 设置虚线x=12
plt.show()  # 显示图像

代码的各部分含义如下:

  1. pc = len(X_train_norm.columns.values) # 25:计算训练集的特征数量,这里的结果是25。
  2. pca = PCA(n_components=pc) # 指定主成分个数:创建一个PCA对象,指定主成分的数量为pc,即25。
  3. pca.fit(X_train_norm):对训练集X_train_norm进行PCA拟合。
  4. sns.reset_orig()sns.set_color_codes('pastel'):这两行代码是使用seaborn库来设置绘图的颜色。reset_orig()会重置颜色到默认设置,set_color_codes('pastel')会将颜色设置为柔和色调。
  5. plt.figure(figsize = (8,4)):创建一个新的图形,设置其大小为8x4。
  6. plt.grid():在图形上显示网格。
  7. plt.title('Explained Variance of Principal Components'):设置图形的标题为“主成分的方差解释”。
  8. plt.plot(pca.explained_variance_ratio_, marker='o'):绘制单个主成分的方差解释比例。
  9. plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o'):绘制累积方差解释比例。
  10. plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"]):为图形添加图例,分别表示单个主成分的方差解释和累积方差解释。
  11. plt.xlabel('Principal Component Indexes'):设置x轴的标签为“主成分索引”。
  12. plt.ylabel('Explained Variance Ratio'):设置y轴的标签为“方差解释比例”。
  13. plt.tight_layout():自动调整图形布局,使其看起来紧凑。
  14. plt.axvline(12, 0, ls='--'):在x=12的位置画一条从y=0到y=1的虚线。这可能是为了标示某个特定的主成分。
  15. plt.show():显示图形。

根据PCA的定义,主成分的顺序是不重要的,它们只按照其方差大小进行排序。

9.4.1 计算累计解释方差

In [57]:

cumsum = np.cumsum(pca.explained_variance_ratio_)  # 计算累计解释性方差
cumsum

Out[57]:

array([0.44924877, 0.6321187 , 0.8046163 , 0.87590932, 0.92253799,
       0.95438576, 0.96762706, 0.97773098, 0.9842774 , 0.98824928,
       0.99088299, 0.99280785, 0.99444757, 0.99576128, 0.99690533,
       0.99781622, 0.99844676, 0.99890236, 0.99924315, 0.99955744,
       0.9997182 , 0.99983861, 0.99992993, 1.        , 1.        ])

In [58]:

indexes = ['PC' + str(i) for i in range(1, pc+1)]

cumsum_df = pd.DataFrame(data=cumsum, index=indexes, columns=['var1'])

cumsum_df.head()

Out[58]:

var1
PC10.449249
PC20.632119
PC30.804616
PC40.875909
PC50.922538

In [59]:

# 保留4位小数
cumsum_df['var2'] = pd.Series([round(val, 4) for val in cumsum_df['var1']], 
                              index = cumsum_df.index)
# 转成百分比
cumsum_df['Cumulative Explained Variance'] = pd.Series(["{0:.2f}%".format(val * 100) for val in cumsum_df['var2']], 
                                                       index = cumsum_df.index)

cumsum_df.head()

Out[59]:

In [60]:

cumsum_df = cumsum_df.drop(['var1','var2'], axis=1, inplace=False)
cumsum_df.T.iloc[:,:15]

9.4.2 指定主成分个数12

In [61]:

pc = 12
pca = PCA(n_components=pc)
pca.fit(X_train_norm)

X_train = pd.DataFrame(pca.transform(X_train_norm))
X_test = pd.DataFrame(pca.transform(X_test_norm))

# 列名设置
X_train.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_test.columns = ['PC' + str(i) for i in range(1, pc+1)]

X_train.head()

Out[61]:

模型评估

交叉验证

基于 k-fold cross-validation的交叉验证:将数据分为k折,前面k-1用于训练,剩下1折用于验证。

分类模型评价指标

1、混淆矩阵

?Predicted?Negative? ?Predicted?Positive? ?Actual?Negative? ?TN? ?FP? ?Actual?Positive? ?FN? ?TP? \begin{array}{ccc} & \text { Predicted Negative } & \text { Predicted Positive } \\ \hline \text { Actual Negative } & \text { TN } & \text { FP } \\ \text { Actual Positive } & \text { FN } & \text { TP } \end{array} ?Actual?Negative??Actual?Positive???Predicted?Negative??TN??FN???Predicted?Positive??FP??TP???

2、准确率

?Accuracy? = T P + T N T P + F P + T N + F N \text { Accuracy }=\frac{T P+T N}{T P+F P+T N+F N} ?Accuracy?=TP+FP+TN+FNTP+TN?

3、精确率

?Precision,? p = T P T P + F P \text { Precision, } p=\frac{T P}{T P+F P} ?Precision,?p=TP+FPTP?

4、召回率

?Recall,? r = T P T P + F N \text { Recall, } r=\frac{T P}{T P+F N} ?Recall,?r=TP+FNTP?

5、F1_score

F 1 s c o r e = 2 1 r + 1 p = 2 r p r + p { F1_{score} }=\frac{2}{\frac{1}{r}+\frac{1}{p}}=\frac{2 r p}{r+p} F1score?=r1?+p1?2?=r+p2rp?

11 基于LightGBM建立二分类模型

In [62]:

# 模型训练
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
[LightGBM] [Info] Number of positive: 4977, number of negative: 17523
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000619 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 22500, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.221200 -> initscore=-1.258687
[LightGBM] [Info] Start training from score -1.258687

Out[62]:

LGBMClassifier

LGBMClassifier()

In [63]:

# 模型预测

y_pred = lgb_clf.predict(X_test)
y_pred

Out[63]:

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

基于baseline的准确率acc:

In [64]:

acc = accuracy_score(y_test, y_pred)

print("模型的准确率:",acc)
模型的准确率: 0.8130666666666667

模型的分类报告:

In [65]:

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      5841
           1       0.64      0.36      0.46      1659

    accuracy                           0.81      7500
   macro avg       0.74      0.65      0.67      7500
weighted avg       0.79      0.81      0.79      7500

模型的混淆矩阵:

In [66]:

# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)

# 将混淆矩阵转换为DataFrame
cm_df = pd.DataFrame(cm, index=['Non-Defaulters', 'Defaulters'], columns=['Non-Defaulters', 'Defaulters'])

# 使用seaborn绘制混淆矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(cm_df, annot=True, cmap='Blues', fmt='d')
plt.title('Confusion Metrics')
plt.xlabel('Predicted value')
plt.ylabel('True Value')
plt.show()

文章来源:https://blog.csdn.net/qq_25443541/article/details/135694028
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。