8.1.1-elasticsearch文本解析之分词器测试及内置分词器配置

最新推荐文章于 2025-03-26 17:39:29 发布

akka_rz

最新推荐文章于 2025-03-26 17:39:29 发布

阅读量614

点赞数

CC 4.0 BY-SA版权

分类专栏： ELK 文章标签： elasticsearch

本文链接：https://blog.csdn.net/weixin_28906733/article/details/106536967

本文详细介绍了Elasticsearch中如何测试和配置内置的分词器。内容包括使用默认的standard analyzer，如何通过API测试analyzer的工作原理，以及如何自定义analyzer以满足特定需求。此外，还探讨了analyzer如何处理关键词的位置和字符偏移量，以及如何配置内置analyzer如standard analyzer来实现停用词过滤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ES中内置针对文本分析的默认分词器为standard analyzer,该分词器无需额外配置可直接使用;
若standard analyzer不满足需求需求,ES还提供了另外一些内置的analyzer,同样无需额外配置即可使用(注意不同analyzer支持的参数),例如standard analyzer还可配置停用词;
若内置的analyzer还是没有满足要求的,可以选择自定义的analyzer,自定义analyzer由不同的analyzer组件构成,这样能够更好地控制整个分析过程;

1、测试analyzer

es提供了测试analyzer的端点,可以在请求中指定内置的analyzer:
(1a)、请求参数:

POST _analyze
{
   
   
  "analyzer": "whitespace",
  "text": ["transmission control protocol","world wide web"]
}

(1b)、请求返回:

{
   
   
  "tokens" : [
    {
   
   
      "token" : "transmission",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
   
   
      "token" : "control",
      "start_offset" : 13,
      "end_offset" : 20,
      "type" : "word",
      "position" : 1
    },
    {
   
   
      "token" : "protocol",
      "start_offset" : 21,
      "end_offset" : 29,
      "type" : "word",
      "position" : 2
    },
    {
   
   
      "token" : "world",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "word",
      "position" : 3
    },
    {
   
   
      "token" : "wide",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 4
    },
    {
   
   
      "token" : "web",
      "start_offset" : 41,
      "end_offset" : 44,
      "type" : "word",
      "position" : 5
    }
  ]
}

因为analyzer由三部分组成,故而可以组合character filter、tokenizer及token filter对文本进行解析:
(2a)、请求参数:

POST _analyze
{
   
   
  "char_filter": ["html_strip"],
  "tokenizer": "standard",
  "filter": ["lowercase","asciifolding"],
  "text": ["<p>Transmission Control Protocol is a transport layer</p>"]
}