注意!
news
索引及数据(映射不配置分词器){{domain}}
等于http://127.0.0.1:9200【默认分词器的使用参考官网。如何设置默认分词器(setting)?如何为字段text类型设置分词方式(mapping)?】
????????内置分词器不需要安装,es自带。这些分词器只能对英文进行分词处理,不能识别中文短语。在创建索引的时候不单独指定分词器,使用的就是es默认分词器standard
。
???????? es默认分词器,按词切分。英文会被切分为一个一个单词,中文会被切分为一个一个字。
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"standard","text":"Introduction to Sanmao"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "standard", # 内置分词器名称,默认分词器
"text": "Introduction to Sanmao" # 需要分词的文本
}
????????看对应文本的分词结果:
????????standard
分词器把英文大写全部转为小写,每个单词分开。
{
"tokens": [
{
"token": "introduction",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "to",
"start_offset": 13,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "sanmao",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
}
]
}
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"standard","text":"三毛简介"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "standard", # 内置分词器名称,默认分词器
"text": "三毛简介" # 需要分词的中文文本
}
????????看对应文本的分词结果:
????????standard
分词器把中文的每个字分开。
????????如果按照这个分词结果到news
索引中的title
字段匹配中文数据,那结果可想而知:它会把title
中所有包含三``毛``简``介
的数据全部匹配出来。匹配出莫名其妙的数据会非常让人崩溃的!!!
{
"tokens": [
{
"token": "三",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "毛",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "简",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "介",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
}
]
}
????????es分词器simple,按非字母的字符分词,例如:数字、标点符号、特殊字符等,会去掉非字母的词,大写字母统一转换成小写。
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"simple","text":"Introduction to Sanmao 三毛简介:三毛生平介绍...全文1800字"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "simple", # 内置分词器simple
"text": "Introduction to Sanmao 三毛简介:三毛生平介绍...全文1800字" # 需要分词的文本
}
????????看对应文本的分词结果:
????????simple
分词器把英文大写全部转为小写,去掉了非字母、非中文的词并把结果做了分割。
{
"tokens": [
{
"token": "introduction",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
},
{
"token": "to",
"start_offset": 13,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "sanmao",
"start_offset": 16,
"end_offset": 22,
"type": "word",
"position": 2
},
{
"token": "三毛简介",
"start_offset": 23,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "三毛生平介绍",
"start_offset": 28,
"end_offset": 34,
"type": "word",
"position": 4
},
{
"token": "全文",
"start_offset": 37,
"end_offset": 39,
"type": "word",
"position": 5
},
{
"token": "字",
"start_offset": 43,
"end_offset": 44,
"type": "word",
"position": 6
}
]
}
????????按照空格进行分词,相当于按照(多)空格split了一下,大写字母不会转换成小写。
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"whitespace","text":"Introduction to Sanmao 三毛简介:三毛生平介绍...全文1800字"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "whitespace", # 空格分词器
"text": "Introduction to Sanmao 三毛简介:三毛生平介绍...全文1800字"
}
????????看对应文本的分词结果:
????????whitespace
仅按照空格分割(也可以按照多空格分割),不处理大小写。
{
"tokens": [
{
"token": "Introduction",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
},
{
"token": "to",
"start_offset": 13,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "Sanmao",
"start_offset": 16,
"end_offset": 22,
"type": "word",
"position": 2
},
{
"token": "三毛简介:三毛生平介绍...全文1800字",
"start_offset": 25,
"end_offset": 46,
"type": "word",
"position": 3
}
]
}
????????停用词分词器。会去掉无意义的词、字符,例如:the、a、an、of 等,大写字母统一转换成小写。(中文无效,这、那)
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"stop","text":"This is a Introduction of Sanmao. | 这是一个三毛介绍。"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "stop", # 空格分词器
"text": "This is a Introduction of Sanmao. | 这是一个三毛介绍。"
}
????????看对应文本的分词结果:
????????stop
分词器:英文中停用词、特殊字符、无意义词去除,并按照这些词分割。但是中文的停用词无效,在英文中,类似于中文里“嗯、啊、这、是、哦”这些无意义的词都会被去掉,其他单词保留。
{
"tokens": [
{
"token": "introduction",
"start_offset": 10,
"end_offset": 22,
"type": "word",
"position": 3
},
{
"token": "sanmao",
"start_offset": 26,
"end_offset": 32,
"type": "word",
"position": 5
},
{
"token": "这是一个三毛介绍",
"start_offset": 36,
"end_offset": 44,
"type": "word",
"position": 6
}
]
}
????????查询文本不拆分,整个文本当作一个词。
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"keyword","text":"This is a Introduction of Sanmao. | 这是一个三毛介绍。"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "keyword", # keyword分词器
"text": "This is a Introduction of Sanmao. | 这是一个三毛介绍。"
}
????????看对应文本的分词结果:
????????keyword
分词器:不进行任何拆分,全文匹配。
{
"tokens": [
{
"token": "This is a Introduction of Sanmao. | 这是一个三毛介绍。",
"start_offset": 0,
"end_offset": 45,
"type": "word",
"position": 0
}
]
}
????????【analysis-icu分词器下载链接(下载后直接解压复制到es的plugins目录下)】
????????下载的压缩包版本必须要和当前es版本一致。
????????除了直接下载已发布的压缩包,还可以到github下载源码自己编译打包,然后放到es的插件目录。
#window系统执行.bat
$ bin> elasticsearch-plugin.bat install analysis-icu
#linux执行
$ bin>./elasticsearch-plugin install analysis-icu
#查看安装了哪些分词插件
$ bin> elasticsearch-plugin.bat list
????????安装失败可能会造成es启动失败或者是加载不到分词器。
????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"icu_analyzer","text":"This is a Introduction of Sanmao | 三毛:她把短暂的一生,活成了十世 By 2024 来源百度知道"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "icu_analyzer",
"text": "This is a Introduction of Sanmao | 三毛:她把短暂的一生,活成了十世 By 2024 来源百度知道"
}
????????看对应文本的分词结果:
????????icu_analyzer
分词器:分词从左到右进行,不会去除特殊字母和汉字,也不会重叠使用词组,仅是从左到右进行了短语分割。
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "introduction",
"start_offset": 10,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "of",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "sanmao",
"start_offset": 26,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "三毛",
"start_offset": 35,
"end_offset": 37,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "她",
"start_offset": 38,
"end_offset": 39,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "把",
"start_offset": 39,
"end_offset": 40,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "短暂",
"start_offset": 40,
"end_offset": 42,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "的",
"start_offset": 42,
"end_offset": 43,
"type": "<IDEOGRAPHIC>",
"position": 10
},
{
"token": "一生",
"start_offset": 43,
"end_offset": 45,
"type": "<IDEOGRAPHIC>",
"position": 11
},
{
"token": "活",
"start_offset": 46,
"end_offset": 47,
"type": "<IDEOGRAPHIC>",
"position": 12
},
{
"token": "成了",
"start_offset": 47,
"end_offset": 49,
"type": "<IDEOGRAPHIC>",
"position": 13
},
{
"token": "十世",
"start_offset": 49,
"end_offset": 51,
"type": "<IDEOGRAPHIC>",
"position": 14
},
{
"token": "by",
"start_offset": 52,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 15
},
{
"token": "2024",
"start_offset": 55,
"end_offset": 59,
"type": "<NUM>",
"position": 16
},
{
"token": "来源",
"start_offset": 60,
"end_offset": 62,
"type": "<IDEOGRAPHIC>",
"position": 17
},
{
"token": "百度",
"start_offset": 62,
"end_offset": 64,
"type": "<IDEOGRAPHIC>",
"position": 18
},
{
"token": "知道",
"start_offset": 64,
"end_offset": 66,
"type": "<IDEOGRAPHIC>",
"position": 19
}
]
}
????????ik分词器是一个标准的中文分词器。它可以根据定义的字典对域进行分词,并且支持用户配置自己的字典,所以它除了可以按通用的习惯分词外,还可以定制化分词。
????????【analysis-ik分词器下载地址(下载压缩包解压后复制到es的plugins目录下)】
????????下载的压缩包版本必须要和当前es版本一致。
????????除了直接下载已发布的压缩包,还可以到github下载源码自己编译打包,然后放到es的插件目录。
安装方式:
#window系统执行.bat
$ bin> elasticsearch-plugin.bat install analysis-ik
#linux执行
$ bin>./elasticsearch-plugin install analysis-ik
????????安装失败可能会造成es启动失败或者是加载不到分词器。
ik分词器有两种模式:
ik_smart
:粗粒度拆分ik_max_word
:细粒度拆分????????postman请求分词器,测试分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"ik_smart","text":"本词条 | 来源百度知道"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "ik_smart", # ik分词器模式
"text": "本词条 | 来源百度知道"
}
????????看对应文本的分词结果:
????????ik_smart
分词器:ik分词器,粗粒度分词。粗粒度划分中文词语。
{
"tokens": [
{
"token": "本",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "词条",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "来源",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 2
},
{
"token": "百度",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 3
},
{
"token": "知道",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 4
}
]
}
????????postman请求分词器,测试分词结果:
????????请求命令:对一个相同的内容进行分词
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"ik_max_word","text":"本词条 | 来源百度知道"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "ik_max_word", # ik分词器模式
"text": "本词条 | 来源百度知道"
}
????????看对应文本的分词结果:
????????ik_max_word
分词器:ik分词器,细粒度分词。细粒度划分中文词语。对比ik_smart
模式,分的词更多了,意味着,在分词查询的时候可能会匹配到更多的结果。
{
"tokens": [
{
"token": "本",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "词条",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "来源",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 2
},
{
"token": "百度",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 3
},
{
"token": "百",
"start_offset": 8,
"end_offset": 9,
"type": "TYPE_CNUM",
"position": 4
},
{
"token": "度",
"start_offset": 9,
"end_offset": 10,
"type": "COUNT",
"position": 5
},
{
"token": "知道",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 6
}
]
}
????????拼音分析插件用于做汉字和拼音之间的转换,集成了NLP工具。
????????如果索引中字段映射了pinyin分词器,那么在通过拼音搜索的时候,输入的拼音会自动和索引中指定pinyin分词类型的字段做匹配。匹配到结果就会被返回。
????????【github下载拼音分词器源码(选择对应es版本的分支)】
????????没有直接找到pinyin分词器的压缩包,所以在github上找到pinyin分词器的源码,需要自己下载下来,编译、打包。然后把打包后的.zip文件解压后复制到es的plugins目录下。
????????补充!【pinyin分词器发布版(压缩包下载地址)】
????????源码拉到本地之后,需要对maven做仓库做下配置,总之,需要mvn能在项目里正常运行。
mvn clean package
target/releases
目录下生成.zip的压缩文件????????复制elasticsearch-analysis-pinyin-6.7.0.zip
到elasticsearch的插件目录下解压:
????????解压后删除xxx-pinyin-6.7.0.zip
压缩文件,并修改plugin-descriptor.properties
的内容:版本号要和当前使用的es版本号一致!
????????安装失败可能会造成es启动失败或者是加载不到分词器。
????????【pinyin分词器参数来源】
????????postman请求分词器,测试拼音分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"pinyin","text":"sanmao"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "pinyin", # pinyin分词器
"text": "sanmao"
}
????????看对应文本的分词结果:
????????pinyin分词器
:pinyin分词器把输入的拼音拆分成了单个拼音、拼音组合、拼音缩写。
{
"tokens": [
{
"token": "san",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "sanmao",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "mao",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
}
]
}
????????postman请求分词器,测试汉字转拼音分词结果:
????????请求命令:
curl -X GET -H 'Content-Type:application/json' -d '{"analyzer":"pinyin","text":"三毛"}' http://127.0.0.1:9200/news/_analyze
# -d:请求参数说明
{
"analyzer": "pinyin", # pinyin分词器
"text": "三毛"
}
????????看对应文本的分词结果:
????????pinyin分词器
:pinyin分词器把输入的汉字拆分成了单个拼音、拼音组合、拼音缩写。
{
"tokens": [
{
"token": "san",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "sm",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "mao",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
}
]
}
????????通过上面的【pinyin分词器简单使用】对拼音/汉字转拼音分词结果有了认知,但是,拼音的分词结果对查询的影响不如icu、ik或者默认分词器这么直观。
下面用一个新索引,示范拼音分词器的使用。
????????上面的news
索引是没有做任何分词器配置的,现在创建一个新索引pinyin_news
并修改默认分词器为拼音分词器。数据结构和数据还是同news
一致。
# 创建pinyin_news索引
curl -X PUT -d '见下面' -H 'Content-Type:application/json' http://127.0.0.1:9200/pinyin_news
# -d参数说明
{
# 创建索引时的配置
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
},
# 分词器的配置
"analysis": {
"analyzer": {
"default": {
# 修改默认分词器为pinyin,不做复杂pinyin分词配置,默认pinyin分词器名称就可以。如果需要可以根据github参数说明修改pinyin分词器的默认配置,那这里就不是这样配置了
"type": "pinyin"
}
}
}
},
# 字段映射关系配置
"mapping": {
# 文档类型:高版本应该取消了_doc了。
"_doc": {
"properties": {
"id": {
"type": "long"
},
# 分词器只能对text类型的字段进行分词,在修改默认分词器的同时还需要修改指定字段的分词类型(指定字段也可以使用其他分词类型)
# 这里对title字段进行pinyin分词
"title": {
"type": "text",
"analyzer": "pinyin"
},
"uv": {
"type": "long"
},
"create_date": {
"type": "date"
},
"status": {
"type": "int"
},
"remark": {
"type": "text"
}
}
}
}
}
# 测试使用postman发起请求
# 向pinyin_news索引批量新增数据
curl -X POST -d '见下面' -H 'Content-Type:application/json' http://127.0.0.1:9200/pinyin_news/_doc/_bulk
# -d参数说明,每一个对象一个换行(\n)
{"index": {"_id": 1}}
{"id":1,"title":"三毛:她把短暂的一生,活成了十世","uv":120,"create_date":"2024-01-15","status":1,"remark":"来源百度搜索"}
{"index": {"_id": 2}}
{"id":2,"title":"我愿一生流浪 | 三毛《撒哈拉的故事","uv":99,"create_date":"2024-01-14","status":1,"remark":"来源知乎搜索"}
{"index": {"_id": 3}}
{"id":3,"title":"离世33年仍是“华语顶流”,三毛“珍贵录音”揭露人生真相:世界是对的,但我也没错!","uv":80,"create_date":"2024-01-15","status":1,"remark":"来源搜狐"}
{"index": {"_id": 4}}
{"id":4,"title":"三毛逝世30周年丨一场与三毛穿越时空的对话","uv":150,"create_date":"2024-01-16","status":1,"remark":"来源澎湃新闻"}
{"index": {"_id": 5}}
{"id":5,"title":"三毛:从自闭少女到天才作家","uv":141,"create_date":"2024-01-18","status":1,"remark":"来源光明网"}
{"index": {"_id": 6}}
{"id":6,"title":"超全整理!三毛最出名的11本著作,没读过的一定要看看","uv":200,"create_date":"2024-01-23","status":1,"remark":"来源知乎搜索"}
{"index": {"_id": 7}}
{"id":7,"title":"三毛的英文名为什么叫Echo?","uv":300,"create_date":"2024-01-21","status":1,"remark":"来源百度知道"}
{"index": {"_id": 8}}
{"id":8,"title":"毛国家统计局发布第三季度贸易数据","uv":50,"create_date":"2024-01-23","status":1,"remark":"来源中华人民共和国商务部"}
{"index": {"_id": 9}}
{"id":9,"title":"网易公布2022年第三季度财报|净收入|毛利润","uv":131,"create_date":"2024-01-22","status":1,"remark":"来源网易科技"}
{"index": {"_id": 10}}
{"id":10,"title":"单季盈利超100亿元!比亚迪三季度毛利率超特斯拉","uv":310,"create_date":"2024-01-23","status":1,"remark":"来源新浪财经"}
# 批量参数最后要留一空行
????????插入的内容如图所示:
????????postman发起请求截图:
????????请求命令:
# 测试使用postman发起请求
# 向pinyin_news索引批量新增数据
curl -X POST -d '见下面' -H 'Content-Type:application/json' http://127.0.0.1:9200/pinyin_news/_search
# -d 参数说明
{
# 使用es布尔查询
"query": {
"bool": {
# 查询必须要包含(sanmaochuanyue)三毛穿越
"must": {
"match": {
"title": "sanmaochuanyue"
}
}
}
},
"from": 0, # 起始页码
"size": 10000, # 每页条数
"sort": [],
"aggs": {}
}
????????查询结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 2.3288696,
"hits": [
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "4",
# 与输入值sanmaochuanyue的匹配度最高
"_score": 2.3288696,
"_source": {
"id": 4,
# title中正好包含三毛穿越,其他匹配三毛的_score依次降低
"title": "三毛逝世30周年丨一场与三毛穿越时空的对话",
"uv": 150,
"create_date": "2024-01-16",
"status": 1,
"remark": "来源澎湃新闻"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "3",
"_score": 0.5825863,
"_source": {
"id": 3,
"title": "离世33年仍是“华语顶流”,三毛“珍贵录音”揭露人生真相:世界是对的,但我也没错!",
"uv": 80,
"create_date": "2024-01-15",
"status": 1,
"remark": "来源搜狐"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "7",
"_score": 0.3807567,
"_source": {
"id": 7,
"title": "三毛的英文名为什么叫Echo?",
"uv": 300,
"create_date": "2024-01-21",
"status": 1,
"remark": "来源百度知道"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "1",
"_score": 0.36986062,
"_source": {
"id": 1,
"title": "三毛:她把短暂的一生,活成了十世",
"uv": 120,
"create_date": "2024-01-15",
"status": 1,
"remark": "来源百度搜索"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "2",
"_score": 0.3044239,
"_source": {
"id": 2,
"title": "我愿一生流浪 | 三毛《撒哈拉的故事",
"uv": 99,
"create_date": "2024-01-14",
"status": 1,
"remark": "来源知乎搜索"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "8",
"_score": 0.25879097,
"_source": {
"id": 8,
"title": "毛国家统计局发布第三季度贸易数据",
"uv": 50,
"create_date": "2024-01-23",
"status": 1,
"remark": "来源中华人民共和国商务部"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "6",
"_score": 0.25162232,
"_source": {
"id": 6,
"title": "超全整理!三毛最出名的11本著作,没读过的一定要看看",
"uv": 200,
"create_date": "2024-01-23",
"status": 1,
"remark": "来源知乎搜索"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "5",
"_score": 0.24291237,
"_source": {
"id": 5,
"title": "三毛:从自闭少女到天才作家",
"uv": 141,
"create_date": "2024-01-18",
"status": 1,
"remark": "来源光明网"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "9",
"_score": 0.20951384,
"_source": {
"id": 9,
"title": "网易公布2022年第三季度财报|净收入|毛利润",
"uv": 131,
"create_date": "2024-01-22",
"status": 1,
"remark": "来源网易科技"
}
},
{
"_index": "pinyin_news",
"_type": "_doc",
"_id": "10",
"_score": 0.1960371,
"_source": {
"id": 10,
"title": "单季盈利超100亿元!比亚迪三季度毛利率超特斯拉",
"uv": 310,
"create_date": "2024-01-23",
"status": 1,
"remark": "来源新浪财经"
}
}
]
}
}
分两步走:
????????第一步: 输入的词通过pinyin分词器
分词,分成若干单个拼音、组合拼音、拼音缩写,然后拿着这些结果到es中的title
字段查询数据。
????????第二步:title
字段中的值,按照相同的pinyin分词方式,进行分词拆分,分成若干单个拼音、组合拼音、拼音缩写,然后拿着分词结果和输入的分词结果做比对,比对成功即返回hits
,但是_score
分数高的排名在前,也就是匹配程度越高,返回结果越靠前。
????????总之,输入的(汉字\拼音)都会被pinyin分词器
转成拼音,被查询的title
字段也会被pinyin分词器
转成拼音,然后再去做数据比对。因为分词规则都是一样的,所以不管是拼音还是汉字都能查询出结果。
GET {{index_name}}/_analyze
需要使用插件中默认的分词器名称。(本文没有涉及到自定义分词器名称的配置)????????查询分词结果还可以不指定索引直接请求:
GET {{domain}}/_analyze
参数:
{
"analyzer": "pinyin",
"text": "我愿一生流浪 | 三毛《撒哈拉的故事》"
}