公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
本文是UCI金融信贷数据集的第二篇文章:基于LightGBM的二分类建模。主要内容包含:
第一步还是导入数据处理和建模所需要的各种库:
In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 100)
from IPython.display import display_html
import plotly_express as px
import plotly.graph_objects as go
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] # 设置字体
plt.rcParams["axes.unicode_minus"]=False # 解决“-”负号的乱码问题
import seaborn as sns
%matplotlib inline
import missingno as ms
import gc
from datetime import datetime
from sklearn.model_selection import train_test_split,StratifiedKFold,GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import KMeansSMOTE, SMOTE
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, auc
from sklearn.metrics import roc_auc_score,precision_recall_curve, confusion_matrix,classification_report
# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv("UCI.csv")
df.head()
Out[2]:
1、整体数据量
整理的数据量大小:30000条记录,25个字段信息
In [3]:
df.shape
Out[3]:
(30000, 25)
2、数据字段信息
In [4]:
df.columns # 全部的字段名
Out[4]:
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
'default.payment.next.month'],
dtype='object')
不同的字段类型统计:
In [5]:
df.dtypes # 查看数据的字段类型
Out[5]:
ID int64
LIMIT_BAL float64
SEX int64
EDUCATION int64
MARRIAGE int64
AGE int64
PAY_0 int64
PAY_2 int64
PAY_3 int64
PAY_4 int64
PAY_5 int64
PAY_6 int64
BILL_AMT1 float64
BILL_AMT2 float64
BILL_AMT3 float64
BILL_AMT4 float64
BILL_AMT5 float64
BILL_AMT6 float64
PAY_AMT1 float64
PAY_AMT2 float64
PAY_AMT3 float64
PAY_AMT4 float64
PAY_AMT5 float64
PAY_AMT6 float64
default.payment.next.month int64
dtype: object
In [6]:
pd.value_counts(df.dtypes) # 统计不同类型的个数
Out[6]:
float64 13
int64 12
Name: count, dtype: int64
从结果中能够看到全部是数值型字段,几乎各占一半。最后一个字段default.payment.next.month
是我们最终的目标字段。
下面对字段名称的具体含义进行解释:
说明内容:
3、数据的描述统计信息(展示部分字段)
In [7]:
df.describe().T # 字段较多,转置后显示更直观
4、字段整体信息
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 30000 non-null int64
1 LIMIT_BAL 30000 non-null float64
2 SEX 30000 non-null int64
3 EDUCATION 30000 non-null int64
4 MARRIAGE 30000 non-null int64
5 AGE 30000 non-null int64
6 PAY_0 30000 non-null int64
7 PAY_2 30000 non-null int64
8 PAY_3 30000 non-null int64
9 PAY_4 30000 non-null int64
10 PAY_5 30000 non-null int64
11 PAY_6 30000 non-null int64
12 BILL_AMT1 30000 non-null float64
13 BILL_AMT2 30000 non-null float64
14 BILL_AMT3 30000 non-null float64
15 BILL_AMT4 30000 non-null float64
16 BILL_AMT5 30000 non-null float64
17 BILL_AMT6 30000 non-null float64
18 PAY_AMT1 30000 non-null float64
19 PAY_AMT2 30000 non-null float64
20 PAY_AMT3 30000 non-null float64
21 PAY_AMT4 30000 non-null float64
22 PAY_AMT5 30000 non-null float64
23 PAY_AMT6 30000 non-null float64
24 default.payment.next.month 30000 non-null int64
dtypes: float64(13), int64(12)
memory usage: 5.7 MB
为了数据处理方便,将原始的default.payment.next.month字段重新命名成Label:
In [9]:
df.rename(columns={"default.payment.next.month":"Label"},inplace=True)
统计每个字段的缺失值个数:
In [10]:
df.isnull().sum().sort_values(ascending=False)
Out[10]:
ID 0
BILL_AMT2 0
PAY_AMT6 0
PAY_AMT5 0
PAY_AMT4 0
PAY_AMT3 0
PAY_AMT2 0
PAY_AMT1 0
BILL_AMT6 0
BILL_AMT5 0
BILL_AMT4 0
BILL_AMT3 0
BILL_AMT1 0
LIMIT_BAL 0
PAY_6 0
PAY_5 0
PAY_4 0
PAY_3 0
PAY_2 0
PAY_0 0
AGE 0
MARRIAGE 0
EDUCATION 0
SEX 0
Label 0
dtype: int64
In [11]:
# 缺失值个数
total = df.isnull().sum().sort_values(ascending=False)
In [12]:
# 缺失值比例
percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False)
percent
Out[12]:
ID 0.0
BILL_AMT2 0.0
PAY_AMT6 0.0
PAY_AMT5 0.0
PAY_AMT4 0.0
PAY_AMT3 0.0
PAY_AMT2 0.0
PAY_AMT1 0.0
BILL_AMT6 0.0
BILL_AMT5 0.0
BILL_AMT4 0.0
BILL_AMT3 0.0
BILL_AMT1 0.0
LIMIT_BAL 0.0
PAY_6 0.0
PAY_5 0.0
PAY_4 0.0
PAY_3 0.0
PAY_2 0.0
PAY_0 0.0
AGE 0.0
MARRIAGE 0.0
EDUCATION 0.0
SEX 0.0
Label 0.0
dtype: float64
将个数和比例的合并,显示完整的缺失值信息:
In [13]:
pd.concat([total, percent],axis=1,keys=["Total","Percent"]).T
In [14]:
ms.bar(df,color="blue")
plt.show()
坐标轴标签的旋转:
In [15]:
# ms.matrix(df, labels=True,label_rotation=45)
# plt.show()
下面进行不同字段的详细数据探索过程:
In [16]:
df.columns
Out[16]:
Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
ID字段对建模无效,直接删除:
In [17]:
df.drop("ID",inplace=True,axis=1)
查看用户的个人信息,比如信用额度、学历、婚姻状态、年龄等字段:
In [18]:
df[['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()
Out[18]:
LIMIT_BAL | EDUCATION | MARRIAGE | AGE | |
---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 167484.322667 | 1.853133 | 1.551867 | 35.485500 |
std | 129747.661567 | 0.790349 | 0.521970 | 9.217904 |
min | 10000.000000 | 0.000000 | 0.000000 | 21.000000 |
25% | 50000.000000 | 1.000000 | 1.000000 | 28.000000 |
50% | 140000.000000 | 2.000000 | 2.000000 | 34.000000 |
75% | 240000.000000 | 2.000000 | 2.000000 | 41.000000 |
max | 1000000.000000 | 6.000000 | 3.000000 | 79.000000 |
In [19]:
df["EDUCATION"].value_counts().sort_values(ascending=False)
Out[19]:
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
用户的学历中出现最多的是本科生EDUCATION=2
In [20]:
df["MARRIAGE"].value_counts().sort_values(ascending=False)
Out[20]:
MARRIAGE
2 15964
1 13659
3 323
0 54
Name: count, dtype: int64
用户的婚姻状态中,出现最多的是MARRIAGE=2,已婚人群。
LIMIT_BAL的分布
In [21]:
df["LIMIT_BAL"].value_counts().sort_values(ascending=False)
Out[21]:
LIMIT_BAL
50000.0 3365
20000.0 1976
30000.0 1610
80000.0 1567
200000.0 1528
...
800000.0 2
1000000.0 1
327680.0 1
760000.0 1
690000.0 1
Name: count, Length: 81, dtype: int64
可以看到信用额度最为频繁的是50,000
In [22]:
plt.figure(figsize = (14,6))
plt.title('Density Plot of LIMIT_BAL')
sns.set_color_codes("pastel")
sns.distplot(df['LIMIT_BAL'],kde=True,bins=200)
plt.show()
每月之前的对应还款状态:
In [23]:
df[["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]].describe()
Out[23]:
PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | -0.291100 |
std | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | 1.149988 |
min | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 |
25% | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 |
不同还款状态的对比:
In [24]:
repay = df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Label']]
repay = pd.melt(repay,
id_vars="Label",
var_name="Payment Status",
value_name="Delay(Month)"
)
repay.head()
Out[24]:
Label | Payment Status | Delay(Month) | |
---|---|---|---|
0 | 1 | PAY_0 | 2 |
1 | 1 | PAY_0 | -1 |
2 | 0 | PAY_0 | 0 |
3 | 0 | PAY_0 | 0 |
4 | 0 | PAY_0 | -1 |
In [25]:
fig = px.box(repay, x="Payment Status", y="Delay(Month)",color="Label")
fig.show()
当月的账单金额
In [26]:
df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()
Out[26]:
是否违约客户的对比:
In [27]:
df.columns
Out[27]:
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
dtype='object')
In [28]:
BILL_AMTS = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(BILL_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red",shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue",shade=True)
plt.xlim(-40000, 200000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
每月之前的对应付款金额
In [29]:
df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()
Out[29]:
PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | |
---|---|---|---|---|---|---|
count | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 |
mean | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 |
std | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 |
min | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 |
50% | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 |
75% | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 |
max | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 |
In [30]:
PAY_AMTS = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
plt.figure(figsize=(12,6))
for i, col in enumerate(PAY_AMTS):
plt.subplot(2,3,i+1)
sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red", shade=True)
sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue", shade=True)
plt.xlim(-10000, 70000)
plt.ylabel("")
plt.xlabel(col, fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
是否发生违约(default.payment.next.month重命名为Label)的人数进行对比:
In [31]:
df["Label"].value_counts()
Out[31]:
Label
0 23364
1 6636
Name: count, dtype: int64
In [32]:
label = df["Label"].value_counts()
df_label = pd.DataFrame(label).reset_index()
df_label
Out[32]:
Label | count | |
---|---|---|
0 | 0 | 23364 |
1 | 1 | 6636 |
In [33]:
# plt.figure(figsize = (6,6))
# plt.title('Default = 0 & Not Default = 1')
# sns.set_color_codes("pastel")
# sns.barplot(x = 'Label', y="count", data=df_label)
# locs, labels = plt.xticks()
# plt.show()
In [34]:
plt.figure(figsize = (5,5))
graph = sns.countplot(x="Label", data=df, palette=["red","blue"])
i = 0
for p in graph.patches:
print(type(p))
h = p.get_height()
percentage = round( 100 * df["Label"].value_counts()[i] / len(df),2)
str_percentage = f"{percentage} %"
graph.text(p.get_x()+p.get_width()/2., h - 100, str_percentage, ha="center")
i += 1
plt.title("class distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
可以看到二者是很不均衡的。
In [35]:
# value_counts = df['Label'].value_counts()
# # 计算每个值的百分比
# percentages = value_counts / len(df)
# # 使用matplotlib绘制柱状图
# plt.bar(value_counts.index, value_counts.values)
# # 在柱状图上添加百分比标签
# for i, v in enumerate(percentages.values):
# plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# # 设置xy轴标签、标题
# plt.title("Class Distribution")
# plt.xticks([0,1], ["Non-Default","Default"])
# plt.xlabel("Default Payment Next Month",fontsize=12)
# plt.ylabel("Number of Clients")
# plt.show()
In [36]:
value_counts = df['Label'].value_counts()
# 计算每个值的百分比
percentages = value_counts / len(df)
# 使用matplotlib绘制柱状图
plt.bar(value_counts.index, value_counts.values)
# 在柱状图上添加百分比标签
for i, v in enumerate(percentages.values):
plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
# 设置xy轴标签、标题
plt.title("Class Distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")
plt.show()
In [37]:
numeric = ['LIMIT_BAL','AGE','PAY_0','PAY_2',
'PAY_3','PAY_4','PAY_5','PAY_6',
'BILL_AMT1','BILL_AMT2','BILL_AMT3',
'BILL_AMT4','BILL_AMT5','BILL_AMT6'] # 全部数值型字段
numeric
Out[37]:
['LIMIT_BAL',
'AGE',
'PAY_0',
'PAY_2',
'PAY_3',
'PAY_4',
'PAY_5',
'PAY_6',
'BILL_AMT1',
'BILL_AMT2',
'BILL_AMT3',
'BILL_AMT4',
'BILL_AMT5',
'BILL_AMT6']
In [38]:
corr = df[numeric].corr()
corr.head()
Out[38]:
相关系数的热力图绘制:
In [39]:
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(12,10))
sns.heatmap(corr,
mask=mask,
vmin=-1,
vmax=1,
center=0,
square=True,
cbar_kws={'shrink': .5},
annot=True,
annot_kws={'size': 10},
cmap="Blues")
plt.show()
In [40]:
plt.figure(figsize=(12,10))
pair_plot = sns.pairplot(df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Label']],
hue='Label',
diag_kind='kde',
corner=True)
pair_plot._legend.remove()
为了检查我们的数据是否为高斯分布,我们使用一种称为分位数-分位数(QQ图)图的图形方法进行定性评估。
在QQ图中,独立变量的分位数与正态分布的预期分位数相对应。如果变量是正态分布的,QQ图中的点应该沿着45度对角线排列。
In [41]:
sns.set_color_codes('pastel') # 设置样式
fig, axs = plt.subplots(5, 3, figsize=(18,18)) # 图像大小和子图设置
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
i, j = 0, 0
for f in numeric:
if j == 3:
j = 0
i = i + 1
stats.probplot(df[f], # 绘图数据:某个字段的全部取值
dist='norm', # 标准化
sparams=(df[f].mean(), df[f].std()),
plot=axs[i,j]) # 子图位置
axs[i,j].get_lines()[0].set_marker('.')
axs[i,j].grid()
axs[i,j].get_lines()[1].set_linewidth(3.0)
j = j+1
fig.tight_layout()
axs[4,2].set_visible(False)
plt.show()
针对分类型数据的处理:
In [42]:
df["EDUCATION"].value_counts()
Out[42]:
EDUCATION
2 14030
1 10585
3 4917
5 280
4 123
6 51
0 14
Name: count, dtype: int64
In [43]:
df["GRAD_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df["UNIVERSITY"] = (df["EDUCATION"] == 2).astype("category")
df["HIGH_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df.drop("EDUCATION",axis=1,inplace=True)
In [44]:
df['MALE'] = (df['SEX'] == 1).astype('category')
df.drop('SEX', axis=1, inplace=True)
In [45]:
df['MARRIED'] = (df['MARRIAGE'] == 1).astype('category')
df.drop('MARRIAGE', axis=1, inplace=True)
In [46]:
# 划分数据
y = df['Label']
X = df.drop('Label', axis=1, inplace=False)
根据y中的类别比例进行切分:
In [47]:
# 切分数据
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=24, stratify=y)
最值归一化:
In [48]:
mm = MinMaxScaler()
X_train_norm = X_train_raw.copy()
X_test_norm = X_test_raw.copy()
In [49]:
# LIMIT_BAL + AGE
X_train_norm['LIMIT_BAL'] = mm.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_norm['LIMIT_BAL'] = mm.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_norm['AGE'] = mm.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_norm['AGE'] = mm.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In [50]:
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_norm[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_norm[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In [51]:
for i in range(1,7):
X_train_norm['BILL_AMT' + str(i)] = mm.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['BILL_AMT' + str(i)] = mm.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_norm['PAY_AMT' + str(i)] = mm.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_norm['PAY_AMT' + str(i)] = mm.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
标准化过程:
In [52]:
ss = StandardScaler()
X_train_std = X_train_raw.copy()
X_test_std = X_test_raw.copy()
X_train_std['LIMIT_BAL'] = ss.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_std['LIMIT_BAL'] = ss.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_std['AGE'] = ss.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_std['AGE'] = ss.transform(X_test_raw['AGE'].values.reshape(-1, 1))
In [53]:
pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]
for pay in pay_list:
X_train_std[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
X_test_std[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))
In [54]:
for i in range(1,7):
X_train_std['BILL_AMT' + str(i)] = ss.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['BILL_AMT' + str(i)] = ss.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
X_train_std['PAY_AMT' + str(i)] = ss.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
X_test_std['PAY_AMT' + str(i)] = ss.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
绘制经过编码后的数据分布:
In [55]:
sns.set_color_codes('deep')
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
fig, axs = plt.subplots(1, 2, figsize=(24,6))
sns.boxplot(data=X_train_norm[numeric], ax=axs[0])
axs[0].set_title('Boxplot of normalized numeric features')
axs[0].set_xticklabels(labels=numeric, rotation=25)
axs[0].set_xlabel(' ')
sns.boxplot(data=X_train_std[numeric], ax=axs[1])
axs[1].set_title('Boxplot of standardized numeric features')
axs[1].set_xticklabels(labels=numeric, rotation=25)
axs[1].set_xlabel(' ')
fig.tight_layout()
plt.show()
In [56]:
pc = len(X_train_norm.columns.values) # 25
pca = PCA(n_components=pc) # 指定主成分个数
pca.fit(X_train_norm)
sns.reset_orig()
sns.set_color_codes('pastel') # 设置绘图颜色
plt.figure(figsize = (8,4)) # 图的大小
plt.grid() # 网格设置
plt.title('Explained Variance of Principal Components') # 标题设置
plt.plot(pca.explained_variance_ratio_, marker='o') # 绘制单个主成分的方差解释比例
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o') # 累计解释方差
plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"]) # 图例设置
plt.xlabel('Principal Component Indexes') # x-y轴标题
plt.ylabel('Explained Variance Ratio')
plt.tight_layout() # 调整布局,更紧凑
plt.axvline(12, 0, ls='--') # 设置虚线x=12
plt.show() # 显示图像
代码的各部分含义如下:
pc = len(X_train_norm.columns.values) # 25
:计算训练集的特征数量,这里的结果是25。pca = PCA(n_components=pc) # 指定主成分个数
:创建一个PCA对象,指定主成分的数量为pc
,即25。pca.fit(X_train_norm)
:对训练集X_train_norm
进行PCA拟合。sns.reset_orig()
和sns.set_color_codes('pastel')
:这两行代码是使用seaborn库来设置绘图的颜色。reset_orig()
会重置颜色到默认设置,set_color_codes('pastel')
会将颜色设置为柔和色调。plt.figure(figsize = (8,4))
:创建一个新的图形,设置其大小为8x4。plt.grid()
:在图形上显示网格。plt.title('Explained Variance of Principal Components')
:设置图形的标题为“主成分的方差解释”。plt.plot(pca.explained_variance_ratio_, marker='o')
:绘制单个主成分的方差解释比例。plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
:绘制累积方差解释比例。plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])
:为图形添加图例,分别表示单个主成分的方差解释和累积方差解释。plt.xlabel('Principal Component Indexes')
:设置x轴的标签为“主成分索引”。plt.ylabel('Explained Variance Ratio')
:设置y轴的标签为“方差解释比例”。plt.tight_layout()
:自动调整图形布局,使其看起来紧凑。plt.axvline(12, 0, ls='--')
:在x=12的位置画一条从y=0到y=1的虚线。这可能是为了标示某个特定的主成分。plt.show()
:显示图形。根据PCA的定义,主成分的顺序是不重要的,它们只按照其方差大小进行排序。
In [57]:
cumsum = np.cumsum(pca.explained_variance_ratio_) # 计算累计解释性方差
cumsum
Out[57]:
array([0.44924877, 0.6321187 , 0.8046163 , 0.87590932, 0.92253799,
0.95438576, 0.96762706, 0.97773098, 0.9842774 , 0.98824928,
0.99088299, 0.99280785, 0.99444757, 0.99576128, 0.99690533,
0.99781622, 0.99844676, 0.99890236, 0.99924315, 0.99955744,
0.9997182 , 0.99983861, 0.99992993, 1. , 1. ])
In [58]:
indexes = ['PC' + str(i) for i in range(1, pc+1)]
cumsum_df = pd.DataFrame(data=cumsum, index=indexes, columns=['var1'])
cumsum_df.head()
Out[58]:
var1 | |
---|---|
PC1 | 0.449249 |
PC2 | 0.632119 |
PC3 | 0.804616 |
PC4 | 0.875909 |
PC5 | 0.922538 |
In [59]:
# 保留4位小数
cumsum_df['var2'] = pd.Series([round(val, 4) for val in cumsum_df['var1']],
index = cumsum_df.index)
# 转成百分比
cumsum_df['Cumulative Explained Variance'] = pd.Series(["{0:.2f}%".format(val * 100) for val in cumsum_df['var2']],
index = cumsum_df.index)
cumsum_df.head()
Out[59]:
In [60]:
cumsum_df = cumsum_df.drop(['var1','var2'], axis=1, inplace=False)
cumsum_df.T.iloc[:,:15]
In [61]:
pc = 12
pca = PCA(n_components=pc)
pca.fit(X_train_norm)
X_train = pd.DataFrame(pca.transform(X_train_norm))
X_test = pd.DataFrame(pca.transform(X_test_norm))
# 列名设置
X_train.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_test.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_train.head()
Out[61]:
基于 k-fold cross-validation的交叉验证:将数据分为k折,前面k-1用于训练,剩下1折用于验证。
1、混淆矩阵
?Predicted?Negative? ?Predicted?Positive? ?Actual?Negative? ?TN? ?FP? ?Actual?Positive? ?FN? ?TP? \begin{array}{ccc} & \text { Predicted Negative } & \text { Predicted Positive } \\ \hline \text { Actual Negative } & \text { TN } & \text { FP } \\ \text { Actual Positive } & \text { FN } & \text { TP } \end{array} ?Actual?Negative??Actual?Positive???Predicted?Negative??TN??FN???Predicted?Positive??FP??TP???
2、准确率
?Accuracy? = T P + T N T P + F P + T N + F N \text { Accuracy }=\frac{T P+T N}{T P+F P+T N+F N} ?Accuracy?=TP+FP+TN+FNTP+TN?
3、精确率
?Precision,? p = T P T P + F P \text { Precision, } p=\frac{T P}{T P+F P} ?Precision,?p=TP+FPTP?
4、召回率
?Recall,? r = T P T P + F N \text { Recall, } r=\frac{T P}{T P+F N} ?Recall,?r=TP+FNTP?
5、F1_score
F 1 s c o r e = 2 1 r + 1 p = 2 r p r + p { F1_{score} }=\frac{2}{\frac{1}{r}+\frac{1}{p}}=\frac{2 r p}{r+p} F1score?=r1?+p1?2?=r+p2rp?
In [62]:
# 模型训练
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
[LightGBM] [Info] Number of positive: 4977, number of negative: 17523
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000619 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 22500, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.221200 -> initscore=-1.258687
[LightGBM] [Info] Start training from score -1.258687
Out[62]:
LGBMClassifier
LGBMClassifier()
In [63]:
# 模型预测
y_pred = lgb_clf.predict(X_test)
y_pred
Out[63]:
array([1, 0, 0, ..., 0, 0, 0], dtype=int64)
基于baseline的准确率acc:
In [64]:
acc = accuracy_score(y_test, y_pred)
print("模型的准确率:",acc)
模型的准确率: 0.8130666666666667
模型的分类报告:
In [65]:
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.84 0.94 0.89 5841
1 0.64 0.36 0.46 1659
accuracy 0.81 7500
macro avg 0.74 0.65 0.67 7500
weighted avg 0.79 0.81 0.79 7500
模型的混淆矩阵:
In [66]:
# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)
# 将混淆矩阵转换为DataFrame
cm_df = pd.DataFrame(cm, index=['Non-Defaulters', 'Defaulters'], columns=['Non-Defaulters', 'Defaulters'])
# 使用seaborn绘制混淆矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(cm_df, annot=True, cmap='Blues', fmt='d')
plt.title('Confusion Metrics')
plt.xlabel('Predicted value')
plt.ylabel('True Value')
plt.show()