在文本使用三维向量的相似度时,对三种相似度的对比。 当前基于已经搭建好的Elasticsearch、Kibana。?
1、创建索引库
PUT my-index-000002
{
"mappings": {
"properties": {
"my_dense_vector": {
"type": "dense_vector",
"dims": 3
},
"status" : {
"type" : "keyword"
}
}
}
}
创建成功:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "my-index-000002"
}
2、放入数据
PUT my-index-000002/_doc/1
{
"my_dense_vector": [1, 0,0],
"status" : "published"
}
PUT my-index-000002/_doc/2
{
"my_dense_vector": [0,1,0],
"status" : "published"
}
PUT my-index-000002/_doc/3
{
"my_dense_vector": [0,0,1],
"status" : "published"
}
返回结果类似如下
{
"_index": "my-index-000002",
"_id": "3",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1
}
3、查看所有数据
GET my-index-000002/_search
结果如下:?
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "my-index-000002",
"_id": "1",
"_score": 1,
"_source": {
"my_dense_vector": [
1,
0,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "2",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
1,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "3",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
0,
1
],
"status": "published"
}
}
]
}
}
4、L1方法查询数据
GET my-index-000002/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))",
"params": {
"queryVector": [0, 0, 1]
}
}
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "my-index-000002",
"_id": "3",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
0,
1
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "1",
"_score": 0.33333334,
"_source": {
"my_dense_vector": [
1,
0,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "2",
"_score": 0.33333334,
"_source": {
"my_dense_vector": [
0,
1,
0
],
"status": "published"
}
}
]
}
}
结果中,id1和id2得分相同,但在文本向量空间中他们不同。
5、使用l2查询
GET my-index-000002/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
"params": {
"queryVector": [0, 0, 1]
}
}
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "my-index-000002",
"_id": "3",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
0,
1
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "1",
"_score": 0.41421357,
"_source": {
"my_dense_vector": [
1,
0,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "2",
"_score": 0.41421357,
"_source": {
"my_dense_vector": [
0,
1,
0
],
"status": "published"
}
}
]
}
}
同样出现相同情况,l1和l2计算文本的距离有相同得分
6、cos 查询
GET my-index-000002/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
"params": {
"query_vector": [0, 0, 1]
}
}
}
}
}
结果
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "my-index-000002",
"_id": "3",
"_score": 2,
"_source": {
"my_dense_vector": [
0,
0,
1
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "1",
"_score": 1,
"_source": {
"my_dense_vector": [
1,
0,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "2",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
1,
0
],
"status": "published"
}
}
]
}
}
三种方法都会产生 不同向量的相同分数情况
GET my-index-000002/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
"params": {
"query_vector": [0, 0, 100]
}
}
}
}
}
结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2,
"hits": [
{
"_index": "my-index-000002",
"_id": "3",
"_score": 2,
"_source": {
"my_dense_vector": [
0,
0,
1
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "1",
"_score": 1,
"_source": {
"my_dense_vector": [
1,
0,
0
],
"status": "published"
}
},
{
"_index": "my-index-000002",
"_id": "2",
"_score": 1,
"_source": {
"my_dense_vector": [
0,
1,
0
],
"status": "published"
}
}
]
}
}
三种方法都会存在 不同空间位置,得到向量距离可能相同的情况