由于反洗钱的数据来自银行内部,没有太多公开数据可供研究,导致在做反洗钱研究工作上不太容易。反洗钱用的交易数据规模较大,都需要好适配环境和工具。本案例的数据和模型是一个很好的示例,可供学习参考。本案例中的数据虽然只有40G,已经有些接近实际100G甚至400G的规模量了,并配有一个案例通过GNN节点分类检测反洗钱节点。
https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml
数据大小40G
洗钱是一个价值数十亿美元的问题。检测洗钱非常困难。大多数自动化算法存在高误报率:将合法交易错误地标记为洗钱。相反的情况也是一个主要问题–漏报,即未检测到的洗钱交易。自然而然,犯罪分子会努力掩盖他们的行踪。
对真实金融交易数据的访问受到严格限制–出于专有和隐私原因。即使可以访问,为每笔交易提供正确的标签(洗钱或合法)也是一个问题–如上所述。这里提供的IBM合成交易数据避免了这些问题。
这里提供的数据基于一个由个人、公司和银行组成的虚拟世界。个人与其他个人和公司进行互动。同样,公司与其他公司和个人进行互动。这些互动可以采取多种形式,例如购买消费品和服务、工业用品的采购订单、支付工资、偿还贷款等等。这些金融交易通常通过银行进行,即付款人和收款人都有账户,账户形式多种多样,从支票到信用卡到比特币。
一些(少数)个人和公司在生成模型中从事犯罪行为,如走私、非法赌博、勒索等。犯罪分子通过这些非法活动获取资金,然后试图通过一系列金融交易隐藏这些非法资金的来源。用于隐藏非法资金的金融交易构成洗钱行为。因此,这里提供的数据是有标签的,可以用于训练和测试AML(反洗钱)模型以及其他目的。
创建这些数据的数据生成器不仅模拟非法活动,还可以通过任意多个交易追踪从非法活动中获得的资金,从而能够将洗钱交易与其非法来源相隔多个步骤进行标记。在此基础上,生成器可以轻松地将个别交易标记为洗钱或合法。
请注意,这个IBM生成器模拟了整个洗钱周期:
作为仅限于合成数据的另一个可能性,实际银行或其他机构通常只能访问涉及洗钱的一部分交易:涉及该银行的交易。在其他银行之间发生的交易是看不到的。因此,基于一个机构的真实交易构建的模型只能对世界有限的视角。
相比之下,这些合成交易包含了整个金融生态系统。因此,可能可以创建了解跨机构交易的广泛范围的洗钱检测模型,但只将这些模型应用于特定银行的交易以进行推断。
.. SMALL MEDIUM LARGE
.. HI LI HI LI HI LI
.. Date Range HI + LI (2022) Sep 1-10 Sep 1-16 Aug 1 - Nov 5
.. # of Days Spanned 10 10 16 16 97 97
.. # of Bank Accounts 515K 705K 2077K 2028K 2116K 2064K
.. # of Transactions 5M 7M 32M 31M 180M 176M
.. # of Laundering Transactions 3.6K 4.0K 35K 16K 223K 100K
.. Laundering Rate (1 per N Trans) 981 1942 905 1948 807 1750
请注意,“提供的“日期范围”是交易活动的“主要”时期。在讨论中,Marco Pianta观察到在指定的日期范围之后还有一些交易,并且这些交易都是洗钱交易。请参阅对Marco的回应,以获取对这种情况及其处理方法的更详细描述。我们感谢Marco提出这个问题。
最后,我们为每个六个数据集提供两个文件:
A. 以CSV格式列出的交易列表
B. 一个文本文件,列出了Suzumura和Kanezashi在其AMLSim模拟器中介绍的8种特定模式之一的洗钱交易。
我们注意到,并非所有数据中的洗钱都遵循这8种模式之一。与上述数据的其他方面一样,了解特定洗钱模式中涉及的所有交易是一个巨大的挑战。
以下是提供的12个文件列表:
1a. HI-Small_Trans.csv 交易
1b. HI-Small_Patterns.txt 洗钱模式交易
2a. HI-Medium_Trans.csv 交易
2b. HI-Medium_Patterns.txt 洗钱模式交易
3a. HI-Large_Trans.csv 交易
3b. HI-Large_Patterns.txt 洗钱模式交易
4a. LI-Small_Trans.csv 交易
4b. LI-Small_Patterns.txt 洗钱模式交易
5a. LI-Medium_Trans.csv 交易
5b. LI-Medium_Patterns.txt 洗钱模式交易
6a. LI-Large_Trans.csv 交易
6b. LI-Large_Patterns.txt 洗钱模式交易
BEGIN LAUNDERING ATTEMPT - STACK
2022/08/09 05:14,00952,8139F54E0,0111632,8062C56E0,5331.44,US Dollar,5331.44,US Dollar,ACH,1
2022/08/13 13:09,0111632,8062C56E0,008456,81363F620,5602.59,US Dollar,5602.59,US Dollar,ACH,1
2022/08/15 07:40,0118693,823D5EB90,013729,801CF2E60,1400.54,US Dollar,1400.54,US Dollar,ACH,1
2022/08/15 14:19,013729,801CF2E60,0123621,81A7090F0,1467.94,US Dollar,1467.94,US Dollar,ACH,1
2022/08/13 12:40,0024750,81363F410,0213834,808757B00,16898.29,US Dollar,16898.29,US Dollar,ACH,1
2022/08/22 06:34,0213834,808757B00,000,800073EF0,17607.19,US Dollar,17607.19,US Dollar,ACH,1
END LAUNDERING ATTEMPT - STACK
BEGIN LAUNDERING ATTEMPT - CYCLE: Max 12 hops
2022/08/01 00:19,0134266,814167590,0036925,810E343A0,132713.46,Yuan,132713.46,Yuan,ACH,1
2022/08/01 13:05,0036925,810E343A0,0119211,814AB4F60,18264.20,US Dollar,18264.20,US Dollar,ACH,1
2022/08/03 13:28,0119211,814AB4F60,0132965,81B88A230,14567.69,Euro,14567.69,Euro,ACH,1
2022/08/09 02:32,0132965,81B88A230,0137089,810C71940,114329.26,Yuan,114329.26,Yuan,ACH,1
2022/08/11 07:16,0137089,810C71940,0216618,81D5302D0,14567.69,Euro,14567.69,Euro,ACH,1
2022/08/13 05:09,0216618,81D5302D0,0024083,81836B520,13629.75,Euro,13629.75,Euro,ACH,1
2022/08/15 18:04,0024083,81836B520,0038110,81B868730,97481.96,Yuan,97481.96,Yuan,ACH,1
2022/08/20 08:57,0038110,81B868730,0225015,81C6EA460,14054.71,US Dollar,14054.71,US Dollar,ACH,1
2022/08/22 12:08,0225015,81C6EA460,018112,8045CC910,13718.22,US Dollar,13718.22,US Dollar,ACH,1
2022/08/22 19:53,018112,8045CC910,007818,8037732C0,12908.33,US Dollar,12908.33,US Dollar,ACH,1
2022/08/27 07:10,007818,8037732C0,0121523,80D1BD2F0,10636.75,Euro,10636.75,Euro,ACH,1
2022/08/30 11:54,0121523,80D1BD2F0,0134266,814167590,1378736.88,Yen,1378736.88,Yen,ACH,1
END LAUNDERING ATTEMPT - CYCLE
本笔记本包括使用PyG库进行GNN模型训练和数据集实现。在这个例子中,我们使用HI-Small_Trans.csv作为我们的训练和测试数据集。
更多详情,请查看https://github.com/issacchan26/AntiMoneyLaunderingDetectionWithGNN
pip install torch_geometric
Requirement already satisfied: torch_geometric in /opt/conda/lib/python3.10/site-packages (2.3.1)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (4.66.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (1.23.5)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (1.11.2)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (3.1.2)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (2.31.0)
Requirement already satisfied: pyparsing in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (3.0.9)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (1.2.2)
Requirement already satisfied: psutil>=5.8.0 in /opt/conda/lib/python3.10/site-packages (from torch_geometric) (5.9.3)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch_geometric) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->torch_geometric) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->torch_geometric) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->torch_geometric) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->torch_geometric) (2023.7.22)
Requirement already satisfied: joblib>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->torch_geometric) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->torch_geometric) (3.1.0)
Note: you may need to restart the kernel to use updated packages.
# 导入所需的库
import datetime # 用于处理日期和时间
import os # 用于操作文件和目录
from typing import Callable, Optional # 用于类型提示
import pandas as pd # 用于数据处理和分析
from sklearn import preprocessing # 用于数据预处理
import numpy as np # 用于科学计算
import torch # 用于构建神经网络
from torch_geometric.data import ( # 用于处理图数据
Data,
InMemoryDataset
)
# 设置pandas显示所有列
pd.set_option('display.max_columns', None)
# 定义数据文件路径
path = '/kaggle/input/ibm-transactions-for-anti-money-laundering-aml/HI-Small_Trans.csv'
# 读取数据文件为DataFrame对象
df = pd.read_csv(path)
让我们来看看数据集
# 打印DataFrame的前几行数据
print(df.head())
Timestamp From Bank Account To Bank Account.1 \
0 2022/09/01 00:20 10 8000EBD30 10 8000EBD30
1 2022/09/01 00:20 3208 8000F4580 1 8000F5340
2 2022/09/01 00:00 3209 8000F4670 3209 8000F4670
3 2022/09/01 00:02 12 8000F5030 12 8000F5030
4 2022/09/01 00:06 10 8000F5200 10 8000F5200
Amount Received Receiving Currency Amount Paid Payment Currency \
0 3697.34 US Dollar 3697.34 US Dollar
1 0.01 US Dollar 0.01 US Dollar
2 14675.57 US Dollar 14675.57 US Dollar
3 2806.97 US Dollar 2806.97 US Dollar
4 36682.97 US Dollar 36682.97 US Dollar
Payment Format Is Laundering
0 Reinvestment 0
1 Cheque 0
2 Reinvestment 0
3 Reinvestment 0
4 Reinvestment 0
在查看数据框之后,我们建议从所有交易中提取接收方和付款方的所有账户,以便对可疑账户进行排序。我们可以将整个数据集转化为节点分类问题,将账户视为节点,将交易视为边。
object
列应该使用 sklearn
的 LabelEncoder
进行编码成类别。
# 打印数据框df中每一列的数据类型
print(df.dtypes)
Timestamp object
From Bank int64
Account object
To Bank int64
Account.1 object
Amount Received float64
Receiving Currency object
Amount Paid float64
Payment Currency object
Payment Format object
Is Laundering int64
dtype: object
检查是否有任何空值
# 打印数据框df中每列的缺失值数量
print(df.isnull().sum())
Timestamp 0
From Bank 0
Account 0
To Bank 0
Account.1 0
Amount Received 0
Receiving Currency 0
Amount Paid 0
Payment Currency 0
Payment Format 0
Is Laundering 0
dtype: int64
有两列分别表示每笔交易的支付和收到的金额,想知道当它们的值相同时,是否有必要将金额拆分为两列,除非存在不同货币之间的交易费用/交易。让我们找出来。
# 打印"Amount Received equals to Amount Paid:",表示输出"收到的金额等于支付的金额"
print('Amount Received equals to Amount Paid:')
# 判断df中的"Amount Received"列是否等于"Amount Paid"列,返回一个布尔值
print(df['Amount Received'].equals(df['Amount Paid']))
# 打印"Receiving Currency equals to Payment Currency:",表示输出"收款货币等于支付货币"
print('Receiving Currency equals to Payment Currency:')
# 判断df中的"Receiving Currency"列是否等于"Payment Currency"列,返回一个布尔值
print(df['Receiving Currency'].equals(df['Payment Currency']))
Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False
似乎涉及不同货币之间的交易,让我们将其打印出来
# 选择不相等的行
not_equal1 = df.loc[~(df['Amount Received'] == df['Amount Paid'])]
# 选择不相等的行
not_equal2 = df.loc[~(df['Receiving Currency'] == df['Payment Currency'])]
# 打印不相等的行
print(not_equal1)
print('---------------------------------------------------------------------------')
print(not_equal2)
Timestamp From Bank Account To Bank Account.1 \
1173 2022/09/01 00:22 1362 80030A870 1362 80030A870
7156 2022/09/01 00:28 11318 800C51010 11318 800C51010
7925 2022/09/01 00:12 795 800D98770 795 800D98770
8467 2022/09/01 00:01 1047 800E92CF0 1047 800E92CF0
11529 2022/09/01 00:22 11157 80135FFC0 11157 80135FFC0
... ... ... ... ... ...
5078167 2022/09/10 23:30 23537 803949A90 23537 803949A90
5078234 2022/09/10 23:59 16163 803638A90 16163 803638A90
5078236 2022/09/10 23:55 16163 803638A90 16163 803638A90
5078316 2022/09/10 23:44 215064 808F06E11 215064 808F06E10
5078318 2022/09/10 23:45 215064 808F06E11 215064 808F06E10
Amount Received Receiving Currency Amount Paid Payment Currency \
1173 52.110000 Euro 61.06 US Dollar
7156 76.060000 Euro 89.12 US Dollar
7925 17.690000 Australian Dollar 12.52 US Dollar
8467 19.430000 Euro 22.77 US Dollar
11529 98.340000 Euro 115.24 US Dollar
... ... ... ... ...
5078167 26421.500000 Shekel 7823.96 US Dollar
5078234 47517.490000 Saudi Riyal 12667.62 US Dollar
5078236 11329.850000 Saudi Riyal 3020.41 US Dollar
5078316 0.000006 Bitcoin 0.07 US Dollar
5078318 0.000004 Bitcoin 0.05 US Dollar
Payment Format Is Laundering
1173 ACH 0
7156 ACH 0
7925 ACH 0
8467 ACH 0
11529 ACH 0
... ... ...
5078167 ACH 0
5078234 ACH 0
5078236 ACH 0
5078316 ACH 0
5078318 Wire 0
[72158 rows x 11 columns]
---------------------------------------------------------------------------
Timestamp From Bank Account To Bank Account.1 \
1173 2022/09/01 00:22 1362 80030A870 1362 80030A870
7156 2022/09/01 00:28 11318 800C51010 11318 800C51010
7925 2022/09/01 00:12 795 800D98770 795 800D98770
8467 2022/09/01 00:01 1047 800E92CF0 1047 800E92CF0
11529 2022/09/01 00:22 11157 80135FFC0 11157 80135FFC0
... ... ... ... ... ...
5078167 2022/09/10 23:30 23537 803949A90 23537 803949A90
5078234 2022/09/10 23:59 16163 803638A90 16163 803638A90
5078236 2022/09/10 23:55 16163 803638A90 16163 803638A90
5078316 2022/09/10 23:44 215064 808F06E11 215064 808F06E10
5078318 2022/09/10 23:45 215064 808F06E11 215064 808F06E10
Amount Received Receiving Currency Amount Paid Payment Currency \
1173 52.110000 Euro 61.06 US Dollar
7156 76.060000 Euro 89.12 US Dollar
7925 17.690000 Australian Dollar 12.52 US Dollar
8467 19.430000 Euro 22.77 US Dollar
11529 98.340000 Euro 115.24 US Dollar
... ... ... ... ...
5078167 26421.500000 Shekel 7823.96 US Dollar
5078234 47517.490000 Saudi Riyal 12667.62 US Dollar
5078236 11329.850000 Saudi Riyal 3020.41 US Dollar
5078316 0.000006 Bitcoin 0.07 US Dollar
5078318 0.000004 Bitcoin 0.05 US Dollar
Payment Format Is Laundering
1173 ACH 0
7156 ACH 0
7925 ACH 0
8467 ACH 0
11529 ACH 0
... ... ...
5078167 ACH 0
5078234 ACH 0
5078236 ACH 0
5078316 ACH 0
5078318 Wire 0
[72170 rows x 11 columns]
两个数据框的大小显示存在交易费用和不同货币之间的交易,我们不能合并/删除金额列。
由于我们将要对列进行编码,我们必须确保相同属性的类别是对齐的。
让我们检查一下收款货币和付款货币的列表是否相同。
# 打印出df['Receiving Currency']列中的唯一值,并按照字母顺序进行排序
print(sorted(df['Receiving Currency'].unique()))
# 打印出df['Payment Currency']列中的唯一值,并按照字母顺序进行排序
print(sorted(df['Payment Currency'].unique()))
['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
['Australian Dollar', 'Bitcoin', 'Brazil Real', 'Canadian Dollar', 'Euro', 'Mexican Peso', 'Ruble', 'Rupee', 'Saudi Riyal', 'Shekel', 'Swiss Franc', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
** 首先,我们将展示在PyG数据集中使用的函数,数据集和模型训练将在底部部分提供。**
在数据预处理中,我们执行以下转换:
# 定义函数 df_label_encoder,用于对指定的列进行标签编码
# 参数:
# - df: 待处理的数据框
# - columns: 需要进行标签编码的列名列表
# 返回值:
# - 经过标签编码后的数据框
# 导入 preprocessing 模块
# 注意:此处不需要增加 import 语句,因为已经明确说明不需要增加
def df_label_encoder(df, columns):
# 创建 LabelEncoder 对象
le = preprocessing.LabelEncoder()
# 遍历指定的列名列表
for i in columns:
# 将列中的值转换为字符串类型,并进行标签编码
df[i] = le.fit_transform(df[i].astype(str))
# 返回经过标签编码后的数据框
return df
# 定义函数 preprocess,用于对数据进行预处理
# 参数:
# - df: 待处理的数据框
# 返回值:
# - 经过预处理后的数据框 df
# - 经过处理后的 receiving_df 数据框
# - 经过处理后的 paying_df 数据框
# - 接收货币种类列表 currency_ls
def preprocess(df):
# 调用 df_label_encoder 函数,对指定列进行标签编码
df = df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
# 将 'Timestamp' 列转换为日期时间类型
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
# 将 'Timestamp' 列转换为时间戳,并将其归一化到 [0, 1] 范围内
df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())
# 将 'Account' 列和 'From Bank' 列进行字符串拼接,作为新的 'Account' 列
df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
# 将 'Account.1' 列和 'To Bank' 列进行字符串拼接,作为新的 'Account.1' 列
df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
# 按照 'Account' 列进行升序排序
df = df.sort_values(by=['Account'])
# 从 df 中提取 'Account.1', 'Amount Received', 'Receiving Currency' 列,作为 receiving_df 数据框
receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
# 从 df 中提取 'Account', 'Amount Paid', 'Payment Currency' 列,作为 paying_df 数据框
paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
# 将 receiving_df 的 'Account.1' 列重命名为 'Account'
receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
# 获取 df 中 'Receiving Currency' 列的唯一值,并进行排序,得到货币种类列表 currency_ls
currency_ls = sorted(df['Receiving Currency'].unique())
# 返回处理后的数据框 df、receiving_df、paying_df 和 currency_ls
return df, receiving_df, paying_df, currency_ls
让我们来看一下处理过的df
# 调用预处理函数,传入数据框df,并接收返回的处理后的数据框df、接收方数据框receiving_df、支付方数据框paying_df和货币列表currency_ls
df, receiving_df, paying_df, currency_ls = preprocess(df)
# 打印处理后的数据框df的前几行
print(df.head())
Timestamp From Bank Account To Bank Account.1 \
4278714 0.456320 10057 10057_803A115E0 29467 29467_803E020C0
2798190 0.285018 10057 10057_803A115E0 29467 29467_803E020C0
2798191 0.284233 10057 10057_803A115E0 29467 29467_803E020C0
3918769 0.417079 10057 10057_803A115E0 29467 29467_803E020C0
213094 0.000746 10057 10057_803A115E0 10057 10057_803A115E0
Amount Received Receiving Currency Amount Paid Payment Currency \
4278714 787197.11 13 787197.11 13
2798190 787197.11 13 787197.11 13
2798191 681262.19 13 681262.19 13
3918769 681262.19 13 681262.19 13
213094 146954.27 13 146954.27 13
Payment Format Is Laundering
4278714 3 0
2798190 3 0
2798191 4 0
3918769 4 0
213094 5 0
支付数据框和接收数据框:
# 打印receiving_df的前几行数据
print(receiving_df.head())
# 打印paying_df的前几行数据
print(paying_df.head())
Account Amount Received Receiving Currency
4278714 29467_803E020C0 787197.11 13
2798190 29467_803E020C0 787197.11 13
2798191 29467_803E020C0 681262.19 13
3918769 29467_803E020C0 681262.19 13
213094 10057_803A115E0 146954.27 13
Account Amount Paid Payment Currency
4278714 10057_803A115E0 787197.11 13
2798190 10057_803A115E0 787197.11 13
2798191 10057_803A115E0 681262.19 13
3918769 10057_803A115E0 681262.19 13
213094 10057_803A115E0 146954.27 13
货币列表:
# 打印出上述列表
print(currency_ls)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
我们希望从付款方和收款方中提取所有唯一的账户作为我们图表的节点。这包括唯一的账户ID、银行代码和“是否洗钱”的标签。
在本节中,我们将考虑涉及非法交易的付款方和收款方都视为可疑账户,我们将标记这两个账户的“是否洗钱”为1。
# 定义函数get_all_account,参数为df
def get_all_account(df):
# 从df中选择'Account'和'From Bank'列,赋值给ldf
ldf = df[['Account', 'From Bank']]
# 从df中选择'Account.1'和'To Bank'列,赋值给rdf
rdf = df[['Account.1', 'To Bank']]
# 从df中选择'Is Laundering'等于1的行,赋值给suspicious
suspicious = df[df['Is Laundering']==1]
# 从suspicious中选择'Account'和'Is Laundering'列,赋值给s1
s1 = suspicious[['Account', 'Is Laundering']]
# 从suspicious中选择'Account.1'和'Is Laundering'列,赋值给s2
s2 = suspicious[['Account.1', 'Is Laundering']]
# 将s2的'Account.1'列重命名为'Account',赋值给s2
s2 = s2.rename({'Account.1': 'Account'}, axis=1)
# 将s1和s2按行连接,赋值给suspicious
suspicious = pd.concat([s1, s2], join='outer')
# 删除suspicious中的重复行,赋值给suspicious
suspicious = suspicious.drop_duplicates()
# 将ldf的'From Bank'列重命名为'Bank',赋值给ldf
ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
# 将rdf的'Account.1'列重命名为'Account',将'To Bank'列重命名为'Bank',赋值给rdf
rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
# 将ldf和rdf按行连接,赋值给df
df = pd.concat([ldf, rdf], join='outer')
# 删除df中的重复行,赋值给df
df = df.drop_duplicates()
# 将df的'Is Laundering'列的所有值设为0
df['Is Laundering'] = 0
# 将df的索引设置为'Account'
df.set_index('Account', inplace=True)
# 使用suspicious的'Account'列作为索引,更新df的'Is Laundering'列
df.update(suspicious.set_index('Account'))
# 重置df的索引,赋值给df
df = df.reset_index()
# 返回df
return df
请查看账户列表:
# 调用get_all_account函数,传入参数df,并将返回的结果赋值给变量accounts
accounts = get_all_account(df)
# 打印accounts DataFrame的前几行数据
print(accounts.head())
Account Bank Is Laundering
0 10057_803A115E0 10057 0
1 10057_803AA8E90 10057 0
2 10057_803AAB430 10057 0
3 10057_803AACE20 10057 0
4 10057_803AB4F70 10057 0
对于节点特征,我们希望将不同类型货币的支付和收到金额的平均值作为每个节点的新特征进行聚合。
# 定义函数paid_currency_aggregate,用于计算支付货币的平均值
# 参数:
# - currency_ls: 支付货币的列表
# - paying_df: 支付数据的DataFrame
# - accounts: 账户信息的DataFrame
def paid_currency_aggregate(currency_ls, paying_df, accounts):
# 遍历支付货币列表
for i in currency_ls:
# 从支付数据中筛选出支付货币为当前货币的数据
temp = paying_df[paying_df['Payment Currency'] == i]
# 计算每个账户的平均支付金额,并将结果存储在accounts中
accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
# 返回计算后的账户信息
return accounts
# 定义函数received_currency_aggregate,用于计算收到货币的平均值
# 参数:
# - currency_ls: 收到货币的列表
# - receiving_df: 收款数据的DataFrame
# - accounts: 账户信息的DataFrame
def received_currency_aggregate(currency_ls, receiving_df, accounts):
# 遍历收到货币列表
for i in currency_ls:
# 从收款数据中筛选出收到货币为当前货币的数据
temp = receiving_df[receiving_df['Receiving Currency'] == i]
# 计算每个账户的平均收款金额,并将结果存储在accounts中
accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
# 将缺失值填充为0
accounts = accounts.fillna(0)
# 返回计算后的账户信息
return accounts
现在,我们可以通过银行代码和不同货币类型的付款和收款金额的平均值来定义节点属性。
# 定义函数get_node_attr,接收四个参数:currency_ls, paying_df, receiving_df, accounts
def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
# 调用paid_currency_aggregate函数,将返回的结果赋值给node_df
node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
# 调用received_currency_aggregate函数,将返回的结果赋值给node_df
node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)
# 将node_df中的'Is Laundering'列的值转换为torch.float类型,并赋值给node_label
node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
# 从node_df中删除'Account'和'Is Laundering'两列,并赋值给node_df
node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
# 调用df_label_encoder函数,将node_df中的'Bank'列进行标签编码,并赋值给node_df
node_df = df_label_encoder(node_df,['Bank'])
# 将node_df转换为torch.float类型,并赋值给node_df(用于可视化时取消注释)
# node_df = torch.from_numpy(node_df.values).to(torch.float)
# 返回node_df和node_label作为函数的结果
return node_df, node_label
请查看node_df的内容:
# 导入了get_node_attr函数,该函数用于获取节点属性
# currency_ls是货币列表,paying_df是支付数据,receiving_df是收款数据,accounts是账户数据
node_df, node_label = get_node_attr(currency_ls, paying_df, receiving_df, accounts)
# 打印node_df的前五行数据
print(node_df.head())
Bank avg paid 0 avg paid 1 avg paid 2 avg paid 3 avg paid 4 \
0 2 0.0 0.0 0.0 0.0 0.0
1 2 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 0.0 0.0
3 2 0.0 0.0 0.0 0.0 0.0
4 2 0.0 0.0 0.0 0.0 0.0
avg paid 5 avg paid 6 avg paid 7 avg paid 8 avg paid 9 avg paid 10 \
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
avg paid 11 avg paid 12 avg paid 13 avg paid 14 avg received 0 \
0 0.0 1922.000000 0.0 0.0 0.0
1 0.0 480.223333 0.0 0.0 0.0
2 0.0 14675.570000 0.0 0.0 0.0
3 0.0 37340.843333 0.0 0.0 0.0
4 0.0 49649.409677 0.0 0.0 0.0
avg received 1 avg received 2 avg received 3 avg received 4 \
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
avg received 5 avg received 6 avg received 7 avg received 8 \
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
avg received 9 avg received 10 avg received 11 avg received 12 \
0 0.0 0.0 0.0 330.166429
1 0.0 0.0 0.0 119.992000
2 0.0 0.0 0.0 14675.570000
3 0.0 0.0 0.0 756.486190
4 0.0 0.0 0.0 3120.573333
avg received 13 avg received 14
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
在边特征方面,我们希望将每个交易视为边。
对于边索引,我们将所有帐户替换为索引,并将其堆叠到大小为[2,交易数]的列表中。
对于边属性,我们使用“时间戳”、“收到的金额”、“收款货币”、“支付金额”、“支付货币”和“支付格式”。
def get_edge_df(accounts, df):
# 重置accounts的索引,并且将索引作为新的一列ID
accounts = accounts.reset_index(drop=True)
accounts['ID'] = accounts.index
# 创建一个映射字典,将账户名映射为对应的ID
mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
# 将df中的'Account'列的值通过映射字典转换为对应的ID,并赋值给'From'列
df['From'] = df['Account'].map(mapping_dict)
# 将df中的'Account.1'列的值通过映射字典转换为对应的ID,并赋值给'To'列
df['To'] = df['Account.1'].map(mapping_dict)
# 删除df中的'Account', 'Account.1', 'From Bank', 'To Bank'列
df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)
# 创建一个二维张量,其中第一行是df['From']列的值,第二行是df['To']列的值
edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)
# 删除df中的'Is Laundering', 'From', 'To'列
df = df.drop(['Is Laundering', 'From', 'To'], axis=1)
# 将df转换为edge_attr,用于可视化
edge_attr = df # for visualization
return edge_attr, edge_index
边属性是指在图中每条边上所携带的信息,可以是数字、文本、向量等形式。在图神经网络中,边属性通常用于描述边的权重、距离、相似度等信息,以便于模型学习图的结构和特征。边属性的定义和使用需要根据具体的应用场景进行设计和调整。
# 导入了名为get_edge_df的函数,用于从accounts和df两个数据集中获取边的属性和边的索引
# edge_attr是边的属性,edge_index是边的索引
# 调用get_edge_df函数,并将返回的结果赋值给edge_attr和edge_index
edge_attr, edge_index = get_edge_df(accounts, df)
# 打印edge_attr的前几行数据
print(edge_attr.head())
Timestamp Amount Received Receiving Currency Amount Paid \
4278714 0.456320 787197.11 13 787197.11
2798190 0.285018 787197.11 13 787197.11
2798191 0.284233 681262.19 13 681262.19
3918769 0.417079 681262.19 13 681262.19
213094 0.000746 146954.27 13 146954.27
Payment Currency Payment Format
4278714 13 3
2798190 13 3
2798191 13 4
3918769 13 4
213094 13 5
edge_index
是一个表示图中边的索引的张量。它是一个大小为
2
×
E
2 \times E
2×E 的张量,其中
E
E
E 是图中边的数量。每一列表示一条边,其中第一行是源节点的索引,第二行是目标节点的索引。
例如,对于一个有
N
N
N 个节点和
M
M
M 条边的无向图,edge_index
可以如下表示:
edge_index = torch.tensor([
[0, 0, 1, 1, 2, 3, 4, 4, 5, 6],
[1, 2, 0, 3, 1, 4, 3, 5, 4, 4],
])
其中,第一列 [0, 0, 1, 1, 2, 3, 4, 4, 5, 6]
表示源节点的索引,第二列 [1, 2, 0, 3, 1, 4, 3, 5, 4, 4]
表示目标节点的索引。这个张量表示了以下的边:
0 -- 1
0 -- 2
1 -- 0
1 -- 3
2 -- 1
3 -- 4
4 -- 3
4 -- 5
5 -- 4
6 -- 4
# 打印出变量edge_index的值
tensor([[ 0, 0, 0, ..., 496997, 496997, 496998],
[299458, 299458, 299458, ..., 496997, 496997, 496998]])
下面我们将展示model.py、train.py和dataset.py的最终代码
在本节中,我们使用了图注意力网络作为我们的骨干模型。该模型由两个GATConv层构建,后跟一个具有sigmoid输出的线性层,用于分类。
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv, Linear
class GAT(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels, heads):
super().__init__()
self.conv1 = GATConv(in_channels, hidden_channels, heads, dropout=0.6) # 第一层GAT卷积层,输入特征维度为in_channels,输出特征维度为hidden_channels,头数为heads,使用dropout防止过拟合
self.conv2 = GATConv(hidden_channels * heads, int(hidden_channels/4), heads=1, concat=False, dropout=0.6) # 第二层GAT卷积层,输入特征维度为hidden_channels * heads,输出特征维度为int(hidden_channels/4),头数为1,不进行特征拼接,使用dropout防止过拟合
self.lin = Linear(int(hidden_channels/4), out_channels) # 线性层,输入特征维度为int(hidden_channels/4),输出特征维度为out_channels
self.sigmoid = nn.Sigmoid() # sigmoid激活函数
def forward(self, x, edge_index, edge_attr):
x = F.dropout(x, p=0.6, training=self.training) # dropout层,以0.6的概率对输入进行dropout操作,用于防止过拟合
x = F.elu(self.conv1(x, edge_index, edge_attr)) # GAT卷积层1的前向传播,使用elu激活函数
x = F.dropout(x, p=0.6, training=self.training) # dropout层,以0.6的概率对输入进行dropout操作,用于防止过拟合
x = F.elu(self.conv2(x, edge_index, edge_attr)) # GAT卷积层2的前向传播,使用elu激活函数
x = self.lin(x) # 线性层的前向传播
x = self.sigmoid(x) # sigmoid激活函数的前向传播
return x
最后,我们可以使用上述函数构建数据集。
class AMLtoGraph(InMemoryDataset):
def __init__(self, root: str, edge_window_size: int = 10,
transform: Optional[Callable] = None,
pre_transform: Optional[Callable] = None):
# 初始化函数,接收root(数据存储路径)、edge_window_size(边窗口大小)、transform(数据转换函数)、pre_transform(预处理函数)作为参数
self.edge_window_size = edge_window_size
super().__init__(root, transform, pre_transform)
# 调用父类的初始化函数
self.data, self.slices = torch.load(self.processed_paths[0])
# 加载已处理的数据
@property
def raw_file_names(self) -> str:
# 返回原始数据文件名
return 'HI-Small_Trans.csv'
@property
def processed_file_names(self) -> str:
# 返回处理后的数据文件名
return 'data.pt'
@property
def num_nodes(self) -> int:
# 返回节点数量
return self._data.edge_index.max().item() + 1
def df_label_encoder(self, df, columns):
# 对DataFrame中的指定列进行标签编码
le = preprocessing.LabelEncoder()
for i in columns:
df[i] = le.fit_transform(df[i].astype(str))
return df
def preprocess(self, df):
# 数据预处理函数,对原始数据进行处理
df = self.df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
# 对指定列进行标签编码
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())
# 将时间戳转换为数值,并进行归一化处理
df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
# 将银行和账户名合并为新的账户名
df = df.sort_values(by=['Account'])
# 按照账户名排序
receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
# 提取收款和付款相关的列
receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
# 重命名列名
currency_ls = sorted(df['Receiving Currency'].unique())
# 获取唯一的货币种类
return df, receiving_df, paying_df, currency_ls
def get_all_account(self, df):
# 获取所有账户
ldf = df[['Account', 'From Bank']]
rdf = df[['Account.1', 'To Bank']]
suspicious = df[df['Is Laundering']==1]
s1 = suspicious[['Account', 'Is Laundering']]
s2 = suspicious[['Account.1', 'Is Laundering']]
s2 = s2.rename({'Account.1': 'Account'}, axis=1)
suspicious = pd.concat([s1, s2], join='outer')
suspicious = suspicious.drop_duplicates()
# 提取可疑账户
ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
df = pd.concat([ldf, rdf], join='outer')
df = df.drop_duplicates()
# 合并账户信息
df['Is Laundering'] = 0
df.set_index('Account', inplace=True)
df.update(suspicious.set_index('Account'))
df = df.reset_index()
# 更新账户的洗钱标签
return df
def paid_currency_aggregate(self, currency_ls, paying_df, accounts):
# 按付款货币种类对账户进行聚合
for i in currency_ls:
temp = paying_df[paying_df['Payment Currency'] == i]
accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
return accounts
def received_currency_aggregate(self, currency_ls, receiving_df, accounts):
# 按收款货币种类对账户进行聚合
for i in currency_ls:
temp = receiving_df[receiving_df['Receiving Currency'] == i]
accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
accounts = accounts.fillna(0)
return accounts
def get_edge_df(self, accounts, df):
# 获取边的DataFrame
accounts = accounts.reset_index(drop=True)
accounts['ID'] = accounts.index
mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
df['From'] = df['Account'].map(mapping_dict)
df['To'] = df['Account.1'].map(mapping_dict)
df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)
edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)
df = df.drop(['Is Laundering', 'From', 'To'], axis=1)
edge_attr = torch.from_numpy(df.values).to(torch.float)
return edge_attr, edge_index
def get_node_attr(self, currency_ls, paying_df,receiving_df, accounts):
# 获取节点属性
node_df = self.paid_currency_aggregate(currency_ls, paying_df, accounts)
node_df = self.received_currency_aggregate(currency_ls, receiving_df, node_df)
node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
node_df = self.df_label_encoder(node_df,['Bank'])
node_df = torch.from_numpy(node_df.values).to(torch.float)
return node_df, node_label
def process(self):
# 数据处理函数
df = pd.read_csv(self.raw_paths[0])
df, receiving_df, paying_df, currency_ls = self.preprocess(df)
accounts = self.get_all_account(df)
node_attr, node_label = self.get_node_attr(currency_ls, paying_df,receiving_df, accounts)
edge_attr, edge_index = self.get_edge_df(accounts, df)
data = Data(x=node_attr,
edge_index=edge_index,
y=node_label,
edge_attr=edge_attr
)
# 构建Data对象
data_list = [data]
if self.pre_filter is not None:
data_list = [d for d in data_list if self.pre_filter(d)]
# 过滤数据
if self.pre_transform is not None:
data_list = [self.pre_transform(d) for d in data_list]
# 数据预处理
data, slices = self.collate(data_list)
# 将数据列表转换为Batch对象
torch.save((data, slices), self.processed_paths[0])
# 保存处理后的数据
请在开始训练之前按照https://github.com/issacchan26/AntiMoneyLaunderingDetectionWithGNN中的说明进行操作。
import torch
import torch_geometric.transforms as T # 导入torch_geometric.transforms模块,用于数据转换
from torch_geometric.loader import NeighborLoader # 导入NeighborLoader类,用于加载数据
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 判断是否有GPU,有则使用GPU,否则使用CPU
dataset = AMLtoGraph('/path/to/AntiMoneyLaunderingDetectionWithGNN/data') # 加载数据集
data = dataset[0] # 取出第一个数据
epoch = 100 # 定义训练轮数
model = GAT(in_channels=data.num_features, hidden_channels=16, out_channels=1, heads=8) # 定义模型,使用GAT模型,输入特征数为data.num_features,隐藏层特征数为16,输出特征数为1,头数为8
model = model.to(device) # 将模型放到GPU上
criterion = torch.nn.BCELoss() # 定义损失函数,使用二分类交叉熵损失函数
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001) # 定义优化器,使用随机梯度下降法,学习率为0.0001
split = T.RandomNodeSplit(split='train_rest', num_val=0.1, num_test=0) # 定义数据集划分方式,使用随机节点划分,将数据集划分为训练集、验证集和测试集
data = split(data) # 对数据集进行划分
train_loader = loader = NeighborLoader( # 定义训练集加载器,使用NeighborLoader类,将数据集data加载进来,每个batch的大小为256,每个节点的邻居数为30
data,
num_neighbors=[30] * 2,
batch_size=256,
input_nodes=data.train_mask,
)
test_loader = loader = NeighborLoader( # 定义测试集加载器,使用NeighborLoader类,将数据集data加载进来,每个batch的大小为256,每个节点的邻居数为30
data,
num_neighbors=[30] * 2,
batch_size=256,
input_nodes=data.val_mask,
)
for i in range(epoch): # 开始训练
total_loss = 0 # 定义总损失
model.train() # 将模型设置为训练模式
for data in train_loader: # 遍历训练集
optimizer.zero_grad() # 梯度清零
data.to(device) # 将数据放到GPU上
pred = model(data.x, data.edge_index, data.edge_attr) # 前向传播,得到预测值
ground_truth = data.y # 获取真实标签
loss = criterion(pred, ground_truth.unsqueeze(1)) # 计算损失
loss.backward() # 反向传播,计算梯度
optimizer.step() # 更新参数
total_loss += float(loss) # 累加损失
if epoch%10 == 0: # 每10轮输出一次训练结果
print(f"Epoch: {i:03d}, Loss: {total_loss:.4f}")
model.eval() # 将模型设置为评估模式
acc = 0 # 定义准确率
total = 0 # 定义总数
for test_data in test_loader: # 遍历测试集
test_data.to(device) # 将数据放到GPU上
pred = model(test_data.x, test_data.edge_index, test_data.edge_attr) # 前向传播,得到预测值
ground_truth = test_data.y # 获取真实标签
correct = (pred == ground_truth.unsqueeze(1)).sum().item() # 计算预测正确的数量
total += len(ground_truth) # 累加总数
acc += correct # 累加正确数量
acc = acc/total # 计算准确率
print('accuracy:', acc) # 输出准确率
本仓库的一些特征工程参考了以下论文,强烈推荐阅读: