在这个笔记本中,我们将采取以下不同的方法:
# 加载autoreload扩展,可以在代码修改后自动重新加载模块
%load_ext autoreload
# 启用autoreload功能
%autoreload
# 安装openai、openai[datalib]和openai[embeddings]、transformers包
%pip install openai 'openai[datalib]' 'openai[embeddings]' transformers
# 导入所需的模块
import openai
import pandas as pd
import numpy as np
import json
import os
# 设置OpenAI API密钥
openai.api_key = os.getenv("OPENAI_API_KEY")
# 定义使用的模型
COMPLETIONS_MODEL = "text-davinci-002"
我们使用的是苏格兰图书馆的公共交易数据集,其中包含超过25k英镑的交易。该数据集有三个我们将使用的特征:
来源:
https://data.nls.uk/data/organisational-data/transactions-over-25k/
# 读取csv文件,并指定编码方式为unicode_escape
transactions = pd.read_csv('./data/25000_spend_dataset_current.csv', encoding= 'unicode_escape')
# 获取transactions的长度
len(transactions)
359
# 查看数据集的前几行
transactions.head()
Date | Supplier | Description | Transaction value (£) | |
---|---|---|---|---|
0 | 21/04/2016 | M & J Ballantyne Ltd | George IV Bridge Work | 35098.0 |
1 | 26/04/2016 | Private Sale | Literary & Archival Items | 30000.0 |
2 | 30/04/2016 | City Of Edinburgh Council | Non Domestic Rates | 40800.0 |
3 | 09/05/2016 | Computacenter Uk | Kelvin Hall | 72835.0 |
4 | 09/05/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 64361.0 |
# 定义函数request_completion,输入参数为prompt
# 调用openai.completions.create方法,创建一个completion_response对象,该对象包含了对prompt的补全结果
# 设置completion_response的参数包括prompt、temperature、max_tokens、top_p、frequency_penalty、presence_penalty和model
# 返回completion_response对象
def request_completion(prompt):
completion_response = openai.completions.create(
prompt=prompt,
temperature=0,
max_tokens=5,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
model=COMPLETIONS_MODEL
)
return completion_response
# 定义函数classify_transaction,输入参数为transaction和prompt
def classify_transaction(transaction,prompt):
prompt = prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))
classification = request_completion(prompt).choices[0].text.replace('\n','')
return classification
# 定义函数check_finetune_classes,输入参数为train_file和valid_file
def check_finetune_classes(train_file,valid_file):
train_classes = set()
valid_classes = set()
with open(train_file, 'r') as json_file:
json_list = list(json_file)
print(len(json_list))
for json_str in json_list:
result = json.loads(json_str)
train_classes.add(result['completion'])
#print(f"result: {result['completion']}")
#print(isinstance(result, dict))
with open(valid_file, 'r') as json_file:
json_list = list(json_file)
print(len(json_list))
for json_str in json_list:
result = json.loads(json_str)
valid_classes.add(result['completion'])
#print(f"result: {result['completion']}")
#print(isinstance(result, dict))
if len(train_classes) == len(valid_classes):
print('All good')
else:
print('Classes do not match, please prepare data again')
首先,我们将使用一个简单的提示来评估基本模型在对这些交易进行分类时的性能。我们将为模型提供5个类别,并为无法归类的交易提供一个“无法分类”的选项。
# 定义一个字符串变量zero_shot_prompt,用于存储任务描述和模板
zero_shot_prompt = '''You are a data expert working for the National Library of Scotland.
You are analysing all transactions over £25,000 in value and classifying them into one of five categories.
The five categories are Building Improvement, Literature & Archive, Utility Bills, Professional Services and Software/IT.
If you can't tell what it is, say Could not classify
Transaction:
Supplier: SUPPLIER_NAME
Description: DESCRIPTION_TEXT
Value: TRANSACTION_VALUE
The classification is:'''
# 以上代码定义了一个字符串变量zero_shot_prompt,用于存储任务描述和模板。该模板用于描述一个数据专家在苏格兰国家图书馆工作的情景,该专家负责分析所有价值超过£25,000的交易,并将其分类为五个类别之一。这五个类别分别是建筑改进、文学与档案、公用事业账单、专业服务和软件/IT。如果无法确定交易属于哪个类别,则回答"Could not classify"。模板中还包含了交易的供应商、描述和价值信息。最后,模板中还包含了分类的结果。
# 从数据集中获取一笔交易记录
transaction = transactions.iloc[0]
# 将交易记录中的值插入到提示语中
prompt = zero_shot_prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))
# 使用我们的自动补全函数返回一个预测结果
completion_response = request_completion(prompt)
# 输出预测结果中的第一个文本
print(completion_response.choices[0].text)
Building Improvement
我们的第一次尝试是正确的,M&J Ballantyne Ltd是一家房屋建筑商,他们执行的工作确实是建筑改进。
让我们扩大样本量到25个,看看它的表现如何,再次只需一个简单的提示来引导它。
# 选择前25行数据作为测试数据
test_transactions = transactions.iloc[:25]
# 使用apply函数对每一行数据应用classify_transaction函数,并将结果存储在新的'Classification'列中
test_transactions['Classification'] = test_transactions.apply(lambda x: classify_transaction(x, zero_shot_prompt), axis=1)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# 统计test_transactions数据集中Classification列中每个类别的数量
test_transactions['Classification'].value_counts()
Building Improvement 14
Could not classify 5
Literature & Archive 3
Software/IT 2
Utility Bills 1
Name: Classification, dtype: int64
# 显示数据集的前25行
test_transactions.head(25)
Date | Supplier | Description | Transaction value (£) | Classification | |
---|---|---|---|---|---|
0 | 21/04/2016 | M & J Ballantyne Ltd | George IV Bridge Work | 35098.0 | Building Improvement |
1 | 26/04/2016 | Private Sale | Literary & Archival Items | 30000.0 | Literature & Archive |
2 | 30/04/2016 | City Of Edinburgh Council | Non Domestic Rates | 40800.0 | Utility Bills |
3 | 09/05/2016 | Computacenter Uk | Kelvin Hall | 72835.0 | Software/IT |
4 | 09/05/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 64361.0 | Building Improvement |
5 | 09/05/2016 | A McGillivray | Causewayside Refurbishment | 53690.0 | Building Improvement |
6 | 16/05/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 365344.0 | Building Improvement |
7 | 23/05/2016 | Computacenter Uk | Kelvin Hall | 26506.0 | Software/IT |
8 | 23/05/2016 | ECG Facilities Service | Facilities Management Charge | 32777.0 | Building Improvement |
9 | 23/05/2016 | ECG Facilities Service | Facilities Management Charge | 32777.0 | Building Improvement |
10 | 30/05/2016 | ALDL | ALDL Charges | 32317.0 | Could not classify |
11 | 10/06/2016 | Wavetek Ltd | Kelvin Hall | 87589.0 | Could not classify |
12 | 10/06/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 381803.0 | Building Improvement |
13 | 28/06/2016 | ECG Facilities Service | Facilities Management Charge | 32832.0 | Building Improvement |
14 | 30/06/2016 | Glasgow City Council | Kelvin Hall | 1700000.0 | Building Improvement |
15 | 11/07/2016 | Wavetek Ltd | Kelvin Hall | 65692.0 | Could not classify |
16 | 11/07/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 139845.0 | Building Improvement |
17 | 15/07/2016 | Sotheby'S | Literary & Archival Items | 28500.0 | Literature & Archive |
18 | 18/07/2016 | Christies | Literary & Archival Items | 33800.0 | Literature & Archive |
19 | 25/07/2016 | A McGillivray | Causewayside Refurbishment | 30113.0 | Building Improvement |
20 | 31/07/2016 | ALDL | ALDL Charges | 32317.0 | Could not classify |
21 | 08/08/2016 | ECG Facilities Service | Facilities Management Charge | 32795.0 | Building Improvement |
22 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866.0 | Could not classify |
23 | 15/08/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 196807.0 | Building Improvement |
24 | 24/08/2016 | ECG Facilities Service | Facilities Management Charge | 32795.0 | Building Improvement |
初始结果非常好,即使没有标记的示例!它无法分类的那些案例是更难的,很少有线索可以确定它们的主题,但如果我们清理标记的数据集以提供更多示例,或许可以获得更好的性能。
让我们从我们已经分类过的小数据集中创建嵌入 - 我们通过在我们的数据集中运行零-shot分类器并手动纠正我们得到的15个无法分类结果来创建了一组带有标签的示例。
这个初始部分重用了从数据集中获取嵌入的笔记本中的方法,通过将所有特征连接起来创建嵌入。
# 读取csv文件,并将数据存储在DataFrame中
df = pd.read_csv('./data/labelled_transactions.csv')
# 显示DataFrame的前几行数据
df.head()
Date | Supplier | Description | Transaction value (£) | Classification | |
---|---|---|---|---|---|
0 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866 | Other |
1 | 29/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 74806 | Building Improvement |
2 | 29/05/2017 | Morris & Spottiswood Ltd | George IV Bridge Work | 56448 | Building Improvement |
3 | 31/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 164691 | Building Improvement |
4 | 24/07/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 27926 | Building Improvement |
# 创建新列'combined',将'Supplier'、'Description'和'Transaction value (£)'的值拼接在一起,并添加文本描述
df['combined'] = "Supplier: " + df['Supplier'].str.strip() + "; Description: " + df['Description'].str.strip() + "; Value: " + str(df['Transaction value (£)']).strip()
# 显示前两行数据
df.head(2)
Date | Supplier | Description | Transaction value (£) | Classification | combined | |
---|---|---|---|---|---|---|
0 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866 | Other | Supplier: Creative Video Productions Ltd; Desc... |
1 | 29/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 74806 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... |
# 导入GPT2TokenizerFast模块
from transformers import GPT2TokenizerFast
# 使用预训练模型"gpt2"初始化tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# 对数据框df中的"combined"列进行编码,并将编码后的长度赋值给新列"n_tokens"
df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))
# 输出数据框df的长度
len(df)
101
# 设置嵌入路径为当前目录下的"data/transactions_with_embeddings_100.csv"
embedding_path = './data/transactions_with_embeddings_100.csv'
# 导入get_embedding函数
from utils.embeddings_utils import get_embedding
# 对df的combined列应用get_embedding函数,使用模型'text-similarity-babbage-001',并将结果保存在df的'babbage_similarity'列中
df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, model='text-similarity-babbage-001'))
# 对df的combined列应用get_embedding函数,使用模型'text-search-babbage-doc-001',并将结果保存在df的'babbage_search'列中
df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001'))
# 将df保存为CSV文件,文件路径为embedding_path
df.to_csv(embedding_path)
现在我们有了嵌入,让我们看看将其分类为我们命名的类别是否能带来更多的成功。
为此,我们将使用Classification_using_embeddings笔记本中的模板。
# 导入所需的库
from sklearn.ensemble import RandomForestClassifier # 随机森林分类器
from sklearn.model_selection import train_test_split # 数据集划分
from sklearn.metrics import classification_report, accuracy_score # 分类报告和准确率评估
from ast import literal_eval # 字符串转换为字典
# 读取embedding数据集
fs_df = pd.read_csv(embedding_path)
# 将babbage_similarity列中的字符串转换为字典,并转换为numpy数组
fs_df["babbage_similarity"] = fs_df.babbage_similarity.apply(literal_eval).apply(np.array)
# 显示数据集的前几行
fs_df.head()
Unnamed: 0 | Date | Supplier | Description | Transaction value (£) | Classification | combined | n_tokens | babbage_similarity | babbage_search | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866 | Other | Supplier: Creative Video Productions Ltd; Desc... | 136 | [-0.009802100248634815, 0.022551486268639565, ... | [-0.00232666521333158, 0.019198870286345482, 0... |
1 | 1 | 29/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 74806 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 140 | [-0.009065819904208183, 0.012094118632376194, ... | [0.005169447045773268, 0.00473341578617692, -0... |
2 | 2 | 29/05/2017 | Morris & Spottiswood Ltd | George IV Bridge Work | 56448 | Building Improvement | Supplier: Morris & Spottiswood Ltd; Descriptio... | 141 | [-0.009000026620924473, 0.02405017428100109, -... | [0.0028343256562948227, 0.021166473627090454, ... |
3 | 3 | 31/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 164691 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 140 | [-0.009065819904208183, 0.012094118632376194, ... | [0.005169447045773268, 0.00473341578617692, -0... |
4 | 4 | 24/07/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 27926 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 140 | [-0.009065819904208183, 0.012094118632376194, ... | [0.005169447045773268, 0.00473341578617692, -0... |
# 导入train_test_split函数,用于将数据集划分为训练集和测试集
# 导入RandomForestClassifier类,用于构建随机森林分类器
# 导入classification_report函数,用于生成分类报告
# 使用train_test_split函数将数据集划分为训练集和测试集,其中X_train和X_test为特征数据,y_train和y_test为目标数据
# list(fs_df.babbage_similarity.values)表示将fs_df中的babbage_similarity列的值转换为列表形式
# test_size=0.2表示测试集占总数据集的比例为20%
# random_state=42表示随机种子,保证每次划分的结果相同
X_train, X_test, y_train, y_test = train_test_split(
list(fs_df.babbage_similarity.values), fs_df.Classification, test_size=0.2, random_state=42
)
# 创建一个随机森林分类器对象,n_estimators=100表示构建100棵决策树
clf = RandomForestClassifier(n_estimators=100)
# 使用训练集数据进行模型训练
clf.fit(X_train, y_train)
# 使用训练好的模型对测试集数据进行预测
preds = clf.predict(X_test)
# 使用训练好的模型对测试集数据进行概率预测
probas = clf.predict_proba(X_test)
# 生成分类报告,y_test为真实标签,preds为预测标签
report = classification_report(y_test, preds)
# 打印分类报告
print(report)
precision recall f1-score support
Building Improvement 0.92 1.00 0.96 11
Literature & Archive 1.00 1.00 1.00 3
Other 0.00 0.00 0.00 1
Software/IT 1.00 1.00 1.00 1
Utility Bills 1.00 1.00 1.00 5
accuracy 0.95 21
macro avg 0.78 0.80 0.79 21
weighted avg 0.91 0.95 0.93 21
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
这个模型的性能相当强大,因此创建嵌入并使用更简单的分类器似乎也是一种有效的方法,而零-shot分类器则帮助我们对未标记的数据集进行初始分类。
让我们再进一步,看看在同一标记数据集上训练的微调模型是否能给我们带来可比较的结果。
对于这个用例,我们将尝试通过在相同标记的101个交易数据集上训练细调模型来改进上面的少样本分类,并将此细调模型应用于一组未见过的交易数据。
首先,我们需要进行一些数据准备工作,以使我们的数据准备就绪。这将包括以下步骤:
# 复制fs_df数据框并将其赋值给ft_prep_df
ft_prep_df = fs_df.copy()
# 计算ft_prep_df数据框的长度
len(ft_prep_df)
101
# 导入pandas库并读取数据
import pandas as pd
ft_prep_df = pd.read_csv('file.csv')
# 查看数据前5行
ft_prep_df.head()
Unnamed: 0 | Date | Supplier | Description | Transaction value (£) | Classification | combined | n_tokens | babbage_similarity | babbage_search | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866 | Other | Supplier: Creative Video Productions Ltd; Desc... | 12 | [-0.009630300104618073, 0.009887108579277992, ... | [-0.008217384107410908, 0.025170527398586273, ... |
1 | 1 | 29/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 74806 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 16 | [-0.006144719664007425, -0.0018709596479311585... | [-0.007424891460686922, 0.008475713431835175, ... |
2 | 2 | 29/05/2017 | Morris & Spottiswood Ltd | George IV Bridge Work | 56448 | Building Improvement | Supplier: Morris & Spottiswood Ltd; Descriptio... | 17 | [-0.005225738976150751, 0.015156379900872707, ... | [-0.007611643522977829, 0.030322374776005745, ... |
3 | 3 | 31/05/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 164691 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 16 | [-0.006144719664007425, -0.0018709596479311585... | [-0.007424891460686922, 0.008475713431835175, ... |
4 | 4 | 24/07/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 27926 | Building Improvement | Supplier: John Graham Construction Ltd; Descri... | 16 | [-0.006144719664007425, -0.0018709596479311585... | [-0.007424891460686922, 0.008475713431835175, ... |
# 将ft_prep_df中的所有分类取出并去重,存入classes列表中
classes = list(set(ft_prep_df['Classification']))
# 创建一个空的DataFrame,将classes列表中的分类存入DataFrame中,并为每个分类分配一个唯一的id
class_df = pd.DataFrame(classes).reset_index()
# 将DataFrame的列名改为class_id和class
class_df.columns = ['class_id','class']
# 输出class_df和其长度
class_df , len(class_df)
( class_id class
0 0 Literature & Archive
1 1 Utility Bills
2 2 Building Improvement
3 3 Software/IT
4 4 Other,
5)
# 合并数据框
ft_df_with_class = ft_prep_df.merge(class_df, left_on='Classification', right_on='class', how='inner')
# 在每个完成的前面添加一个空格,以帮助模型
ft_df_with_class['class_id'] = ft_df_with_class.apply(lambda x: ' ' + str(x['class_id']), axis=1)
# 删除class列
ft_df_with_class = ft_df_with_class.drop('class', axis=1)
# 在每个提示的末尾添加一个常见的分隔符,以便模型知道何时提示终止
ft_df_with_class['prompt'] = ft_df_with_class.apply(lambda x: x['combined'] + '\n\n###\n\n', axis=1)
# 显示前几行数据
ft_df_with_class.head()
Unnamed: 0 | Date | Supplier | Description | Transaction value (£) | Classification | combined | n_tokens | babbage_similarity | babbage_search | class_id | prompt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 15/08/2016 | Creative Video Productions Ltd | Kelvin Hall | 26866 | Other | Supplier: Creative Video Productions Ltd; Desc... | 12 | [-0.009630300104618073, 0.009887108579277992, ... | [-0.008217384107410908, 0.025170527398586273, ... | 4 | Supplier: Creative Video Productions Ltd; Desc... |
1 | 51 | 31/03/2017 | NLS Foundation | Grant Payment | 177500 | Other | Supplier: NLS Foundation; Description: Grant P... | 11 | [-0.022305507212877274, 0.008543581701815128, ... | [-0.020519884303212166, 0.01993306167423725, -... | 4 | Supplier: NLS Foundation; Description: Grant P... |
2 | 70 | 26/06/2017 | British Library | Legal Deposit Services | 50056 | Other | Supplier: British Library; Description: Legal ... | 11 | [-0.01019938476383686, 0.015277703292667866, -... | [-0.01843327097594738, 0.03343546763062477, -0... | 4 | Supplier: British Library; Description: Legal ... |
3 | 71 | 24/07/2017 | ALDL | Legal Deposit Services | 27067 | Other | Supplier: ALDL; Description: Legal Deposit Ser... | 11 | [-0.008471488021314144, 0.004098685923963785, ... | [-0.012966590002179146, 0.01299362163990736, 0... | 4 | Supplier: ALDL; Description: Legal Deposit Ser... |
4 | 100 | 24/07/2017 | AM Phillip | Vehicle Purchase | 26604 | Other | Supplier: AM Phillip; Description: Vehicle Pur... | 10 | [-0.003459023078903556, 0.004626389592885971, ... | [-0.0010945454705506563, 0.008626140654087067,... | 4 | Supplier: AM Phillip; Description: Vehicle Pur... |
# 这一步是不必要的,如果每个类别中都有足够数量的观察值
# 在我们的情况下,我们没有,所以我们对数据进行洗牌,以便在训练集和验证集中获得相等的类别数量,增加我们得到的模型的准确性
# 如果验证集中的类别数量较少,我们的微调模型将出错,所以这一步是必要的
import random
# 从ft_df_with_class的'class_id'列中获取标签
labels = [x for x in ft_df_with_class['class_id']]
# 从ft_df_with_class的'prompt'列中获取文本
text = [x for x in ft_df_with_class['prompt']]
# 创建一个名为ft_df的DataFrame,其中包含'text'和'labels'两列
ft_df = pd.DataFrame(zip(text, labels), columns = ['prompt','class_id']) #[:300]
# 将列名改为'prompt'和'completion'
ft_df.columns = ['prompt','completion']
# 为ft_df添加一个名为'ordering'的新列,该列的值是一个随机整数,范围是从0到ft_df的长度
ft_df['ordering'] = ft_df.apply(lambda x: random.randint(0,len(ft_df)), axis = 1)
# 将'ordering'列设置为索引
ft_df.set_index('ordering',inplace=True)
# 根据'ordering'列的值对ft_df进行升序排序
ft_df_sorted = ft_df.sort_index(ascending=True)
# 显示排序后的前几行数据
ft_df_sorted.head()
prompt | completion | |
---|---|---|
ordering | ||
0 | Supplier: Sothebys; Description: Literary & Ar... | 0 |
1 | Supplier: Sotheby'S; Description: Literary & A... | 0 |
2 | Supplier: City Of Edinburgh Council; Descripti... | 1 |
2 | Supplier: John Graham Construction Ltd; Descri... | 2 |
3 | Supplier: John Graham Construction Ltd; Descri... | 2 |
# 删除已经存在的训练/验证集文件,以便为此分类器生成新的训练/验证集
#!rm transactions_grouped*
# 将我们的随机排序的数据框输出到一个.jsonl文件中,并运行prepare_data函数以获取我们的输入文件
ft_df_sorted.to_json("transactions_grouped.jsonl", orient='records', lines=True)
!openai tools fine_tunes.prepare_data -f transactions_grouped.jsonl -q
# 定义一个函数,用于检查类别是否在训练集和验证集中都存在
check_finetune_classes('transactions_grouped_prepared_train.jsonl','transactions_grouped_prepared_valid.jsonl')
31
8
All good
# 这一步创建了模型
!openai api fine_tunes.create -t "transactions_grouped_prepared_train.jsonl" -v "transactions_grouped_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 5 -m curie
# 你可以使用以下命令获取微调作业状态和模型名称,将作业名称替换为你的作业
#!openai api fine_tunes.get -i ft-YBIc01t4hxYBC7I5qhRF3Qdx
# 恭喜您,您已经拥有了一个经过精细调整的模型!
# 将下面提供的名称复制/粘贴到下面的变量中,我们将对其进行测试
fine_tuned_model = 'curie:ft-personal-2022-10-20-10-42-56'
现在我们将应用我们的分类器来查看它的表现。我们的训练集中只有31个独特的观察值,验证集中有8个,所以让我们看看表现如何。
# 导入必要的库
import pandas as pd
# 读取jsonl文件,并将其转换为DataFrame格式
test_set = pd.read_json('transactions_grouped_prepared_valid.jsonl', lines=True)
# 显示DataFrame的前几行数据
test_set.head()
prompt | completion | |
---|---|---|
0 | Supplier: Wavetek Ltd; Description: Kelvin Hal... | 2 |
1 | Supplier: ECG Facilities Service; Description:... | 1 |
2 | Supplier: M & J Ballantyne Ltd; Description: G... | 2 |
3 | Supplier: Private Sale; Description: Literary ... | 0 |
4 | Supplier: Ex Libris; Description: IT equipment... | 3 |
# 使用fine_tuned_model对test_set中的每个样本进行预测
# prompt为每个样本的输入数据
# max_tokens设置为1,表示只生成一个token作为预测结果
# temperature设置为0,表示生成结果时完全按照模型的预测结果进行选择,不考虑随机性
# logprobs设置为5,表示返回生成结果时,同时返回前5个最可能的token及其概率
test_set['predicted_class'] = test_set.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['prompt'], max_tokens=1, temperature=0, logprobs=5),axis=1)
# 从预测结果中提取出最终的预测类别
# x['predicted_class']['choices'][0]['text']表示预测结果中的第一个选择的文本内容
test_set['pred'] = test_set.apply(lambda x : x['predicted_class']['choices'][0]['text'],axis=1)
# 创建一个新的列'result',并将其值设置为应用lambda函数的结果
# lambda函数用于比较'pred'和'completion'两列的值是否相等
# strip()函数用于去除字符串两端的空格
# axis = 1表示按行应用lambda函数
test_set['result'] = test_set.apply(lambda x: str(x['pred']).strip() == str(x['completion']).strip(), axis = 1)
# 统计test_set中result列中每个值出现的次数,并按照次数从大到小排序
test_set['result'].value_counts()
True 4
False 4
Name: result, dtype: int64
性能不是很好 - 不幸的是这是可以预料的。由于每个类别只有几个示例,使用嵌入和传统分类器的上述方法效果更好。
一个经过微调的模型在有大量标记观察数据时效果最好。如果我们有几百或几千个标记观察数据,我们可能会得到更好的结果,但让我们在保留集上进行最后一次测试,以确认它不能很好地推广到新的观察数据集。
# 复制transactions数据集,并从第101行开始创建一个新的数据集holdout_df
holdout_df = transactions.copy().iloc[101:]
# 打印holdout_df的前几行数据
holdout_df.head()
Date | Supplier | Description | Transaction value (£) | |
---|---|---|---|---|
101 | 23/10/2017 | City Building LLP | Causewayside Refurbishment | 53147.0 |
102 | 30/10/2017 | ECG Facilities Service | Facilities Management Charge | 35758.0 |
103 | 30/10/2017 | ECG Facilities Service | Facilities Management Charge | 35758.0 |
104 | 06/11/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 134208.0 |
105 | 06/11/2017 | ALDL | Legal Deposit Services | 27067.0 |
# 给holdout_df添加一个名为'combined'的新列,该列的值为"Supplier: " + holdout_df['Supplier']的去除首尾空格后的字符串 + "; Description: " + holdout_df['Description']的去除首尾空格后的字符串 + '\n\n###\n\n'
holdout_df['combined'] = "Supplier: " + holdout_df['Supplier'].str.strip() + "; Description: " + holdout_df['Description'].str.strip() + '\n\n###\n\n'
# 对holdout_df应用一个lambda函数,该函数使用openai.Completion.create方法生成预测结果,并将结果存储在'prediction_result'列中
# openai.Completion.create方法的参数包括模型fine_tuned_model、prompt为x['combined']、max_tokens为1、temperature为0、logprobs为5
holdout_df['prediction_result'] = holdout_df.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['combined'], max_tokens=1, temperature=0, logprobs=5),axis=1)
# 对holdout_df应用一个lambda函数,该函数从'prediction_result'列中获取预测结果的文本,并将结果存储在'pred'列中
holdout_df['pred'] = holdout_df.apply(lambda x : x['prediction_result']['choices'][0]['text'],axis=1)
# 输出测试集的前10行数据
holdout_df.head(10)
Date | Supplier | Description | Transaction value (£) | combined | prediction_result | pred | |
---|---|---|---|---|---|---|---|
101 | 23/10/2017 | City Building LLP | Causewayside Refurbishment | 53147.0 | Supplier: City Building LLP; Description: Caus... | {'id': 'cmpl-63YDadbYLo8xKsGY2vReOFCMgTOvG', '... | 2 |
102 | 30/10/2017 | ECG Facilities Service | Facilities Management Charge | 35758.0 | Supplier: ECG Facilities Service; Description:... | {'id': 'cmpl-63YDbNK1D7UikDc3xi5ATihg5kQEt', '... | 2 |
103 | 30/10/2017 | ECG Facilities Service | Facilities Management Charge | 35758.0 | Supplier: ECG Facilities Service; Description:... | {'id': 'cmpl-63YDbwfiHjkjMWsfTKNt6naeqPzOe', '... | 2 |
104 | 06/11/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 134208.0 | Supplier: John Graham Construction Ltd; Descri... | {'id': 'cmpl-63YDbWAndtsRqPTi2ZHZtPodZvOwr', '... | 2 |
105 | 06/11/2017 | ALDL | Legal Deposit Services | 27067.0 | Supplier: ALDL; Description: Legal Deposit Ser... | {'id': 'cmpl-63YDbDu7WM3svYWsRAMdDUKtSFDBu', '... | 2 |
106 | 27/11/2017 | Maggs Bros Ltd | Literary & Archival Items | 26500.0 | Supplier: Maggs Bros Ltd; Description: Literar... | {'id': 'cmpl-63YDbxNNI8ZH5CJJNxQ0IF9Zf925C', '... | 0 |
107 | 30/11/2017 | Glasgow City Council | Kelvin Hall | 42345.0 | Supplier: Glasgow City Council; Description: K... | {'id': 'cmpl-63YDb8R1FWu4bjwM2xE775rouwneV', '... | 2 |
108 | 11/12/2017 | ECG Facilities Service | Facilities Management Charge | 35758.0 | Supplier: ECG Facilities Service; Description:... | {'id': 'cmpl-63YDcAPsp37WhbPs9kwfUX0kBk7Hv', '... | 2 |
109 | 11/12/2017 | John Graham Construction Ltd | Causewayside Refurbishment | 159275.0 | Supplier: John Graham Construction Ltd; Descri... | {'id': 'cmpl-63YDcML2welrC3wF0nuKgcNmVu1oQ', '... | 2 |
110 | 08/01/2018 | ECG Facilities Service | Facilities Management Charge | 35758.0 | Supplier: ECG Facilities Service; Description:... | {'id': 'cmpl-63YDc95SSdOHnIliFB2cjMEEm7Z2u', '... | 2 |
# 对holdout_df中的预测结果进行计数并返回计数结果
holdout_df['pred'].value_counts()
2 231
0 27
Name: pred, dtype: int64
好吧,那些结果同样令人失望 - 所以我们已经了解到,对于一个带有少量标记观察数据集,零样本分类或使用嵌入进行传统分类比微调模型能够得到更好的结果。
微调模型仍然是一个很好的工具,但在你有更多标记示例来对每个要分类的类进行分类时,它更加有效。