大模型系列:OpenAI使用技巧_使用文本向量做语义文本搜索

发布时间:2023年12月30日

我们可以通过将搜索查询嵌入并找到最相似的评论,以非常高效且低成本的方式对所有评论进行语义搜索。数据集是在Get_embeddings_from_dataset Notebook中创建的。

# 导入所需的库
import pandas as pd
import numpy as np
from ast import literal_eval

# 定义数据文件路径
datafile_path = "data/fine_food_reviews_with_embeddings_1k.csv"

# 读取CSV文件并将其存储为DataFrame对象
df = pd.read_csv(datafile_path)

# 将embedding列中的字符串转换为Python对象,并将其存储为NumPy数组
df["embedding"] = df.embedding.apply(literal_eval).apply(np.array)

在这里,我们比较查询和文档的嵌入的余弦相似度,并显示前n个最佳匹配。

# 导入所需的函数
from utils.embeddings_utils import get_embedding, cosine_similarity

# 定义函数:通过产品描述搜索评论
def search_reviews(df, product_description, n=3, pprint=True):
    # 获取产品描述的嵌入向量
    product_embedding = get_embedding(
        product_description,
        model="text-embedding-ada-002"
    )
    
    # 计算每个评论的嵌入向量与产品描述嵌入向量的余弦相似度,并添加到DataFrame中
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))

    # 根据相似度降序排序,取前n个结果,并对结果进行格式处理
    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
        .combined.str.replace("Title: ", "")
        .str.replace("; Content:", ": ")
    )
    
    # 如果pprint为True,则打印结果
    if pprint:
        for r in results:
            print(r[:200])
            print()
    
    # 返回结果
    return results

# 调用函数进行搜索
results = search_reviews(df, "delicious beans", n=3)
Good Buy:  I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!

Jamaican Blue beans:  Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and befor

Delicious!:  I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning
# 在DataFrame df 中搜索包含 "whole wheat pasta" 关键词的评论,并返回前3条结果
results = search_reviews(df, "whole wheat pasta", n=3)
Tasty and Quick Pasta:  Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara.  I just wish there was more of it.  If you aren't starving or on a 

sooo good:  tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.

Handy:  Love the idea of ready in a minute pasta and for that alone this product gets praise.  The pasta is whole grain so that's a big plus and it actually comes out al dente.  The vegetable marinara

我们可以轻松地搜索这些评论。为了加快计算速度,我们可以使用一种特殊的算法,旨在通过嵌入进行更快速的搜索。



# 调用search_reviews函数,搜索包含"bad delivery"关键词的评论,并返回1条结果
results = search_reviews(df, "bad delivery", n=1)
great product, poor delivery:  The coffee is excellent and I am a repeat buyer.  Problem this time was with the UPS delivery.  They left the box in front of my garage door in the middle of the drivewa

正如我们所看到的,这可以立即提供很多价值。在这个例子中,我们展示了能够快速找到交付失败的示例。

# 在数据框df中搜索包含关键词"spoilt"的评论,并返回前1条结果
results = search_reviews(df, "spoilt", n=1)
Extremely dissapointed:  Hi,<br />I am very disappointed with the past shipment I received of the ONE coconut water. 3 of the boxes were leaking and the coconut water was spoiled.<br /><br />Thanks.<b

# 在DataFrame中搜索包含"pet food"关键词的评论,并返回前2条结果
results = search_reviews(df, "pet food", n=2)
Good food:  The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.

The cats like it:  My 7 cats like this food but it is a little yucky for the human. Pieces of mackerel swimming in a dark broth. It is billed as a "complete" food and contains carrots, peas and pasta.
文章来源:https://blog.csdn.net/wjjc1017/article/details/135310909
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。