ES分词器
normalization
进行单词的转换,如时态、大小写等,移除停用词,使得文档规范化,提高召回率
GET _analyze
{
"text": "Mr. Ma is an excellent teacher",
"analyzer": "english"
}
字符过滤器
分词之前的预处理,过滤无用字符
HTML Strip Character Filter
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm so <a>happy</a>!</p>"
}
// 结果
"token": """
I'm so happy!
"""
Mapping Character Filter
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"滚 => *",
"垃 => *",
"圾 => *"
]
}
},
// ...
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "你就是个垃圾,滚!"
}
// 结果
"token": "你就是个**,*!",
Pattern Replace Character Filter
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d{3})\\d{4}(\\d{4})",
"replacement": "$1****$2"
}
},
// ...
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "手机号是13812347890"
}
// 结果
"token": "手机号是138****7890",
令牌过滤器
令牌过滤器(token filter):停用词、时态转换、大小写转换、同义词转换、语气词处理等。
安装ik分词器
# 替换对应es版本号
./elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/8.14.3
使用同义词文件:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["my_synonym"]
}
}
}
}
}
GET test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["大G", "su7", "霸总"]
}
GET test_index/_analyze
{
"analyzer": "ik_max_word",
"text": ["奔驰G级", "小米su7", "霸道总裁"]
}
使用同义词映射:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"大G => 奔驰G级",
"su7 => 小米su7",
"霸总 => 霸道总裁"
]
}
}
}
}
}
大小写转换:
GET test_index/_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": ["HELLO WORLD"]
}
结合脚本使用:
GET test_index/_analyze
{
"tokenizer": "standard",
"filter": {
"type": "condition",
"filter": "uppercase",
"script": {
"source": "token.getTerm().length() < 5"
}
},
"text": ["hello cat"]
}
停用词:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}
GET test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["ChatGPT will choose to search the web based on what you ask, or you can manually choose to search by clicking the web search icon."]
}
// will to the on等词被舍弃
分词器
standard
:默认分词器,基于空格拆分,中文会逐字拆分
ik_max_word
:一款中文分词器
GET test_index/_analyze
{
"tokenizer": "ik_max_word",
"text": ["持之以恒推进全面从严治党"]
}
自定义分词器
PUT custom_analysis
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": {
"my_stopword": {
"type": "stop",
"stopwords": ["is", "in", "the", "a", "at", "for"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[ ,.!?]"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["my_char_filter"], // 过滤字符
"filter": ["my_stopword"], // 过滤停用词、大小写时态转换等
"tokenizer": "my_tokenizer" // 切词
}
}
}
}
}
GET custom_analysis/_analyze
{
"analyzer": "my_analyzer",
"text": ["What's up,hey & man!at a moment!|?for? easy!"]
}
中文分词器
IK分词器:https://github.com/infinilabs/analysis-ik
IK分词器文件结构
IKAnalyzer.cfg.xml
:IK分词配置文件main.dic
:主词库stopword.dic
:英文停用词,不会建立在倒排索引中- 特殊词库
quantifier.dic
:计量单位等suffix.dic
:行政单位surname.dic
:百家姓preposition.dic
:语气词
- 自定义词库:网络词库、流行词、自造词等
ik_max_word
、ik_smart
(粒度更大)
扩展词库:
// IKAnalyzer.cfg.xml
<entry key="ext_dict">custom/msb_extend.dic;custom/msb_extend2.dic</entry>
热更新
基于远程词库
<entry key="remote_ext_dict">location</entry>
<entry key="remote_ext_stopwords">location</entry>
location是个URL,需要header:Last-Modified
、ETag
,内容是纯文本,utf-8编码,每行一个词。
基于数据库
需要修改IK源码:Dictionary#loadMainDict