Elasticsearch权威指南：使用Shingles提升邻近匹配效果-CSDN博客

本文链接：https://blog.csdn.net/gitblog_01085/article/details/148523884

Elasticsearch权威指南：使用Shingles提升邻近匹配效果

理解Shingles的概念与应用场景

在Elasticsearch的邻近匹配（Proximity Matching）中，短语查询和邻近查询虽然有用，但存在两个主要限制：

过于严格：要求所有词项都必须存在
失去上下文：即使使用slop参数获得灵活性，也无法保留单词间的语义关联

Shingles（词片）技术正是为了解决这些问题而设计的。它通过索引单词组合（而不仅是单个单词）来保留更多的上下文信息。

Shingles的工作原理

基本概念

Shingles本质上是一种n-gram技术，特别适用于文本搜索场景：

Unigram：单个单词，如["sue", "ate", "the", "alligator"]
Bigram：相邻的两个单词组合，如["sue ate", "ate the", "the alligator"]
Trigram：相邻的三个单词组合，如["sue ate the", "ate the alligator"]

为什么Shingles有效

用户搜索时往往会使用与原始文档相似的语言结构。虽然Shingles不能解决所有问题（如"sue alligator"这样的非连续查询），但在大多数实际场景中，它能显著提升搜索结果的相关性。

实现Shingles的完整方案

1. 创建自定义分析器

首先需要配置一个包含shingle过滤器（filter）的自定义分析器：

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type": "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 2,
                    "output_unigrams": false
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter"
                    ]
                }
            }
        }
    }
}

关键配置说明：

output_unigrams: false：确保只输出bigrams
分析器先进行标准分词，然后转为小写，最后应用shingle过滤

2. 使用多字段映射

最佳实践是将unigrams和bigrams分开存储，使用多字段(multi-field)映射：

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "title": {
                "type": "text",
                "fields": {
                    "shingles": {
                        "type": "text",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
            }
        }
    }
}

这种设计允许我们：

主字段(title)存储原始unigrams
子字段(title.shingles)存储bigrams
可以独立查询这两个字段

3. 索引示例文档

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }

搜索策略与效果对比

基础匹配查询

GET /my_index/my_type/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

结果分析：

文档1和2得分相同（都包含the、alligator、ate）
无法区分"Sue ate"和"alligator ate"的语义差异

增强的Shingles查询

GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

改进效果：