ES中内置针对文本分析的默认分词器为standard analyzer,该分词器无需额外配置可直接使用;
若standard analyzer不满足需求需求,ES还提供了另外一些内置的analyzer,同样无需额外配置即可使用(注意不同analyzer支持的参数),例如standard analyzer还可配置停用词;
若内置的analyzer还是没有满足要求的,可以选择自定义的analyzer,自定义analyzer由不同的analyzer组件构成,这样能够更好地控制整个分析过程;
1、测试analyzer
es提供了测试analyzer的端点,可以在请求中指定内置的analyzer:
(1a)、请求参数:
POST _analyze
{
"analyzer": "whitespace",
"text": ["transmission control protocol","world wide web"]
}
(1b)、请求返回:
{
"tokens" : [
{
"token" : "transmission",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 0
},
{
"token" : "control",
"start_offset" : 13,
"end_offset" : 20,
"type" : "word",
"position" : 1
},
{
"token" : "protocol",
"start_offset" : 21,
"end_offset" : 29,
"type" : "word",
"position" : 2
},
{
"token" : "world",
"start_offset" : 30,
"end_offset" : 35,
"type" : "word",
"position" : 3
},
{
"token" : "wide",
"start_offset" : 36,
"end_offset" : 40,
"type" : "word",
"position" : 4
},
{
"token" : "web",
"start_offset" : 41,
"end_offset" : 44,
"type" : "word",
"position" : 5
}
]
}
因为analyzer由三部分组成,故而可以组合character filter、tokenizer及token filter对文本进行解析:
(2a)、请求参数:
POST _analyze
{
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase","asciifolding"],
"text": ["<p>Transmission Control Protocol is a transport layer</p>"]
}
(2b)、请求返回:
{
"tokens" : [
{