【ES实战】Elasticsearch中模糊匹配类的查询

原创已于 2025-05-27 14:12:01 修改 · 2.3k 阅读

30 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch #搜索引擎

于 2025-04-24 18:12:56 首次发布

Elastic实战专栏收录该内容

56 篇文章

订阅专栏

Elasticsearch中模糊匹配类的查询

文章目录

Elasticsearch中模糊匹配类的查询

Elasticsearch中模糊类查询主要有以下

Wildcard Query：通配符查询
Prefix Query：前缀匹配查询
Regexp Query：正则匹配查询
Fuzzy query ：模糊查询
对text类型同时配置keyword类型，同时支持全文搜索和精确查询

使用前，性能需要进行验证

通配符查询

通配符支持*和?，*可以代表任何字符序列，包含空字符； ?可以代表任一字符。

示例

GET /_search
{
    "query": {
        "wildcard" : { "user" : "ki*y" }
    }
}

前缀匹配查询

查询到匹配字段包含以给定开头的词的文档

示例

GET gudong20250423001/_search
{ "query": {
    "prefix" : { "message" : "中国" }
  }
}

正则匹配查询

示例

GET gudong20250423001/_search
{ "query": {
    "prefix" : { "message" : "中*北" }
  }
}
GET gudong20250423001/_search
{
    "query": {
        "regexp":{
            "message": {
                "value": "中*北",
                "flags" : "INTERSECTION|COMPLEMENT|EMPTY"
            }
        }
    }
}

以下是支持的正则表达式语法（Lucene正则表达式引擎与perl不兼容，但支持更小范围的操作符）

标准的正则操作

Anchoring 锚定正则表达式

用来说明这是个正则表达式，使用 ^ 表示开头或 $ 表示结尾。
Allowed characters 允许使用的字符

表达式中可以使用任何 Unicode 字符，但某些字符是保留字符，必须转义。标准的保留字符有：

. ? + * | { } [ ] ( ) ” \

如果启用了可选功能（见下文），这些字符也可能被保留：

# @ & < > ~

任何保留字符都可以用反斜杠\*转义，包括字面反斜杠字符：\\

此外，任何字符（双引号除外）在被双引号包围时都按字面解释：john“@smith.com”。
Match any character匹配任何字符

.点可以匹配任何字符

For string "abcde":

ab... # match a.c.e # match
One-or-more 匹配一次或多次

加号 + 可用来重复前面的最短模式一次或多次

For string "aaabbb":

a+b+ # match

aa+bb+ # match

a+.+ # match

aa+bbb+ # match`
Zero-or-more 匹配0次或多次

星号 * 可用来匹配前面最短模式的零次或多次。

For string "aaabbb":

a*b* # match

a*b*c* # match

.*bbb.* # match

aaa*bbb* # match
**Zero-or-one **匹配0次或1次

问号 ? 可用来匹配前面最短模式的零次或1次。

For string "aaabbb":

aaa?bbb? # match

aaaa?bbbb? # match

.....?.? # match

aa?bb? # no match
Min-to-max匹配最小次数与最大次数

大括号 {} 可用于指定前面最短模式重复的最少次数和（可选）最多次数。

{x,y}，x代表最小匹配次数，y代表最多匹配次数，y可选

{5} 最少重复匹配5次

{2,5} 最少重复匹配2次，最多重复匹配5次

{2,} 最少重复匹配2次

For string “aaabbb”:

a{3}b{3} # match

a{2,4}b{2,4} # match

a{2,}b{2,} # match

.{3}.{3} # match

a{4}b{4} # no match

a{4,6}b{4,6} # no match

a{4,}b{4,} # no match`
Grouping 分组匹配

括号() 可以用来组成子句型。

For string "ababab":

(ab)+ # match

ab(ab)+ # match

(..)+ # match

(...)+ # no match

(ab)* # match

abab(ab)? # match

ab(ab)? # no match

(ab){3} # match

(ab){1,2} # no match
Alternation

管道符号 | 充当 OR 运算符。如果左侧或右侧的模式匹配，则匹配成功。交替适用于最长的模式，而不是最短的模式。

For string "aabb":

aabb|bbaa # match

aacc|bb # no match

aa(cc|bb) # match

a+|b+ # no match

a+b+|b+a+ # match

a+(b|c)+ # match
Character classes

方括号[]括起来的潜在字符范围可以表示为字符类。

^ 表示否定字符类。

-表示一个字符范围，除非它是第一个字符或用反斜杠转义。

The allowed forms are:

[abc] # ‘a’ or ‘b’ or ‘c’

[a-c] # ‘a’ or ‘b’ or ‘c’

[-abc] # ‘-’ or ‘a’ or ‘b’ or ‘c’

[abc\-] # ‘-’ or ‘a’ or ‘b’ or ‘c’

[^abc] # any character except ‘a’ or ‘b’ or ‘c’

[^a-c] # any character except ‘a’ or ‘b’ or ‘c’

[^-abc] # any character except ‘-’ or ‘a’ or ‘b’ or ‘c’

[^abc\-] # any character except ‘-’ or ‘a’ or ‘b’ or ‘c’

For string abcd:

ab[cd]+ # match

[a-d]+ # match

[^a-d]+ # no match

特殊运算符操作

flags 代表正则的支持特殊运算符的类型，取值元素有ALL，COMPLEMENT，INTERSECTION，INTERVAL，ANYSTRING。

flags 参数默认为 ALL。不同的标志组合（用 |连接）可用于启用/禁用特定运算符：

{
    "regexp": {
        "username": {
            "value": "john~athon<1-5>",
            "flags": "COMPLEMENT|INTERVAL"
        }
    }
}

Complement

Complement模式下，最短的表达式，如果紧跟着转折号~，就会被否定的意思

For instance, "ab~cd" means:
- 以 a 开头
- 以 b 开头
- 后跟任意长度的字符串，除了 c 以外的任何字符串
- 以 d 结尾
For the string "abcdef":

ab~df # match

ab~cf # match

ab~cdef # no match

a~(cb)def # match

a~(bc)def # no match

Enabled with the COMPLEMENT or ALL flags.
Interval

Interval模式下，使用<>代表数据范围区间

For string: "foo80":

foo<1-100> # match

foo<01-100> # match

foo<001-100> # no match

Enabled with the INTERVAL or ALL flags.
Intersection

该模式下，双引号 “&” 连接两个表达式，两个表达式必须匹配。

For string "aaabbb":

aaa.+&.+bbb # match

aaa&bbb # no match

Using this feature usually means that you should rewrite your regular expression.Enabled with the INTERSECTION or ALL flags.
Any string

符号 “@” 匹配整个字符串。这可以与上面的交集和补码结合起来，表达 “除了…之外的所有内容”。

For instance:

@&~(foo.+) # anything except string beginning with “foo”

Enabled with the ANYSTRING or ALL flags.

模糊化查询

模糊查询会生成在模糊度指定的最大编辑距离内的匹配术语，然后检查术语字典，找出索引中实际存在的已生成术语。最终查询最多使用 max_expansions 匹配词。

示例

GET /_search
{
    "query": {
       "fuzzy" : { "user" : "ki" }
    }
}

GET /_search
{
    "query": {
        "fuzzy" : {
            "user" : {
                "value": "ki",
                "boost": 1.0,
                "fuzziness": 2,
                "prefix_length": 0,
                "max_expansions": 100
            }
        }
    }
}

参数名	解释
`fuzziness`	最大编辑距离. 默认 `AUTO`. 详细见Fuzziness章节
`prefix_length`	不会被 “模糊化 ”的初始字符数。这有助于减少必须检查的术语数量。默认为 `0`。
`max_expansions`	模糊化查询将扩展到的术语的最大数量 Defaults to `50`.
`transpositions`	Whether fuzzy transpositions (`ab` → `ba`) are supported. Default is `false`.

Fuzziness

在查询 text or keyword 字段时，“模糊度 ”被解释为Levenshtein Edit Distance–一个字符串要与另一个字符串相同而需要更改的一个字符的数量。

Levenshtein Edit Distance（莱文斯坦编辑距离）表示两个字符串之间，一个字符串转换成另一个字符串需要最少编辑操作的次数，编辑操作分为插入一个字符，删除一个字符。

fuzziness 参数可指定为：

参数值	说明
`0`, `1`, `2`	数字值，最大允许的Levenshtein编辑距离（或编辑次数）
`AUTO`	根据词条长度生成编辑距离。可选择提供低距离和高距离参数 `AUTO:[low],[high]`。如果未指定，默认值为 3 和 6，相当于 `AUTO:3,6`，长度为：`0...2`必须完全匹配`3...5`允许一次编辑`>5`允许两次编辑`AUTO`通常是 `fuzziness`的首选值。

`text`类型同时配置`keyword`类型

一个字段同时是keyword和text，同时支持全文搜索和精确查询

示例

新建索引

PUT gudong20250423002
{
  "settings:": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "message": {
          "type": "keyword",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

PUT gudong20250423002
{
  "settings:": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "message": {
          "type": "keyword"
        }
      }
    }
  }
}

写入数据

POST gudong20250423001/doc
{"message":"中国江苏省南京市江北新区"}

POST gudong20250423002/doc
{"message":"中国江苏省南京市江北新区"}

查询效果

GET gudong20250423001/_search
{
  "query": {
    "bool": {
      "filter": [{ "match": { "message": "中国" } }]
    }
  }
}


{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0,
        "hits": [
            {
                "_index": "gudong20250423001",
                "_type": "doc",
                "_id": "AZZhpkVj-8S4QQ2j15fP",
                "_score": 0,
                "_source": {
                    "message": "中国江苏省南京市江北新区"
                }
            }
        ]
    }
}

GET gudong20250423002/_search
{
  "query": {
    "bool": {
      "filter": [{ "match": { "message": "中国" } }]
    }
  }
}

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

字段单纯配置keyword类型，则不支持模糊匹配（实际应该分词了，然后进行的匹配）

彩蛋

增加date类型字段的format，5.4支持，6.X不支持。

PUT my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  }
}

PUT my_index/_mapping/doc
{
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss.S"
        }
  }
}