ES分词器

normalization

进行单词的转换，如时态、大小写等，移除停用词，使得文档规范化，提高召回率

GET _analyze
{
  "text": "Mr. Ma is an excellent teacher",
  "analyzer": "english"
}

字符过滤器

分词之前的预处理，过滤无用字符

HTML Strip Character Filter

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <a>happy</a>!</p>"
}

// 结果
"token": """
I'm so happy!
"""

Mapping Character Filter

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "滚 => *",
            "垃 => *",
            "圾 => *"
          ]
        }
      },
      // ...
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "你就是个垃圾，滚！"
}

// 结果
"token": "你就是个**，*！",

Pattern Replace Character Filter

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d{3})\\d{4}(\\d{4})",
          "replacement": "$1****$2"
        }
      },
      // ...
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "手机号是13812347890"
}

// 结果
"token": "手机号是138****7890",

令牌过滤器

令牌过滤器（token filter）：停用词、时态转换、大小写转换、同义词转换、语气词处理等。

安装ik分词器

# 替换对应es版本号
./elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/8.14.3

使用同义词文件：

PUT /test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym_graph",
          "synonyms_path": "analysis/synonym.txt"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["my_synonym"]
        }
      }
    }
  }
}

GET test_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["大G", "su7", "霸总"]
}

GET test_index/_analyze
{
  "analyzer": "ik_max_word",
  "text": ["奔驰G级", "小米su7", "霸道总裁"]
}

使用同义词映射：

PUT /test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "大G => 奔驰G级",
            "su7 => 小米su7",
            "霸总 => 霸道总裁"
          ]
        }
      }
    }
  }
}

大小写转换：

GET test_index/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": ["HELLO WORLD"]
}

结合脚本使用：

GET test_index/_analyze
{
  "tokenizer": "standard",
  "filter": {
    "type": "condition",
    "filter": "uppercase",
    "script": {
      "source": "token.getTerm().length() < 5"
    }
  },
  "text": ["hello cat"]
}

停用词：

PUT /test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET test_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["ChatGPT will choose to search the web based on what you ask, or you can manually choose to search by clicking the web search icon."]
}

// will to the on等词被舍弃

分词器

standard ：默认分词器，基于空格拆分，中文会逐字拆分

ik_max_word ：一款中文分词器

GET test_index/_analyze
{
  "tokenizer": "ik_max_word",
  "text": ["持之以恒推进全面从严治党"]
}

自定义分词器

PUT custom_analysis
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "my_stopword": {
          "type": "stop",
          "stopwords": ["is", "in", "the", "a", "at", "for"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[ ,.!?]"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["my_char_filter"],  // 过滤字符
          "filter": ["my_stopword"],  // 过滤停用词、大小写时态转换等
          "tokenizer": "my_tokenizer"  // 切词
        }
      }
    }
  }
}

GET custom_analysis/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["What's up,hey & man!at a moment!|?for? easy!"]
}

中文分词器

IK分词器：https://github.com/infinilabs/analysis-ik

IK分词器文件结构

IKAnalyzer.cfg.xml :IK分词配置文件
main.dic ：主词库
stopword.dic ：英文停用词，不会建立在倒排索引中
特殊词库
- quantifier.dic ：计量单位等
- suffix.dic：行政单位
- surname.dic：百家姓
- preposition.dic：语气词
自定义词库：网络词库、流行词、自造词等

ik_max_word 、ik_smart （粒度更大）

扩展词库：

// IKAnalyzer.cfg.xml
<entry key="ext_dict">custom/msb_extend.dic;custom/msb_extend2.dic</entry>

热更新

基于远程词库

<entry key="remote_ext_dict">location</entry>
<entry key="remote_ext_stopwords">location</entry>

location是个URL，需要header：Last-Modified、ETag ，内容是纯文本，utf-8编码，每行一个词。

基于数据库

需要修改IK源码：Dictionary#loadMainDict

normalization​

字符过滤器​

令牌过滤器​

分词器​

自定义分词器​

中文分词器​

热更新​

基于远程词库​

基于数据库​