跳到主要内容

ES分词器

normalization

进行单词的转换,如时态、大小写等,移除停用词,使得文档规范化,提高召回率

GET _analyze
{
"text": "Mr. Ma is an excellent teacher",
"analyzer": "english"
}

字符过滤器

分词之前的预处理,过滤无用字符

HTML Strip Character Filter

PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
}
}
}
}

GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I&apos;m so <a>happy</a>!</p>"
}

// 结果
"token": """
I'm so happy!
"""

Mapping Character Filter

PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"滚 => *",
"垃 => *",
"圾 => *"
]
}
},
// ...
}
}
}

GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "你就是个垃圾,滚!"
}

// 结果
"token": "你就是个**,*!",

Pattern Replace Character Filter

PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d{3})\\d{4}(\\d{4})",
"replacement": "$1****$2"
}
},
// ...
}
}
}

GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "手机号是13812347890"
}

// 结果
"token": "手机号是138****7890",

令牌过滤器

令牌过滤器(token filter):停用词、时态转换、大小写转换、同义词转换、语气词处理等。

安装ik分词器

# 替换对应es版本号
./elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/8.14.3

使用同义词文件:

PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["my_synonym"]
}
}
}
}
}

GET test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["大G", "su7", "霸总"]
}

GET test_index/_analyze
{
"analyzer": "ik_max_word",
"text": ["奔驰G级", "小米su7", "霸道总裁"]
}

使用同义词映射:

PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
"大G => 奔驰G级",
"su7 => 小米su7",
"霸总 => 霸道总裁"
]
}
}
}
}
}

大小写转换:

GET test_index/_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": ["HELLO WORLD"]
}

结合脚本使用:

GET test_index/_analyze
{
"tokenizer": "standard",
"filter": {
"type": "condition",
"filter": "uppercase",
"script": {
"source": "token.getTerm().length() < 5"
}
},
"text": ["hello cat"]
}

停用词:

PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}

GET test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["ChatGPT will choose to search the web based on what you ask, or you can manually choose to search by clicking the web search icon."]
}

// will to the on等词被舍弃

分词器

standard :默认分词器,基于空格拆分,中文会逐字拆分

ik_max_word :一款中文分词器

GET test_index/_analyze
{
"tokenizer": "ik_max_word",
"text": ["持之以恒推进全面从严治党"]
}

自定义分词器

PUT custom_analysis
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": {
"my_stopword": {
"type": "stop",
"stopwords": ["is", "in", "the", "a", "at", "for"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[ ,.!?]"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["my_char_filter"], // 过滤字符
"filter": ["my_stopword"], // 过滤停用词、大小写时态转换等
"tokenizer": "my_tokenizer" // 切词
}
}
}
}
}

GET custom_analysis/_analyze
{
"analyzer": "my_analyzer",
"text": ["What's up,hey & man!at a moment!|?for? easy!"]
}

中文分词器

IK分词器:https://github.com/infinilabs/analysis-ik

IK分词器文件结构

  • IKAnalyzer.cfg.xml :IK分词配置文件
  • main.dic :主词库
  • stopword.dic :英文停用词,不会建立在倒排索引中
  • 特殊词库
    • quantifier.dic :计量单位等
    • suffix.dic:行政单位
    • surname.dic:百家姓
    • preposition.dic:语气词
  • 自定义词库:网络词库、流行词、自造词等

ik_max_wordik_smart (粒度更大)

扩展词库:

// IKAnalyzer.cfg.xml
<entry key="ext_dict">custom/msb_extend.dic;custom/msb_extend2.dic</entry>

热更新

基于远程词库

<entry key="remote_ext_dict">location</entry>
<entry key="remote_ext_stopwords">location</entry>

location是个URL,需要header:Last-ModifiedETag ,内容是纯文本,utf-8编码,每行一个词。

基于数据库

需要修改IK源码:Dictionary#loadMainDict