Elasticsearch使用篇 - 词项聚合、稀有词项聚合、多词项聚合

等後那场雪

已于 2023-03-10 16:56:23 修改

阅读量830

点赞数

CC 4.0 BY-SA版权

分类专栏： Elasticsearch 文章标签： elasticsearch 搜索引擎

于 2023-02-23 15:40:40 首次发布

本文链接：https://blog.csdn.net/qq_34561892/article/details/129183161

terms aggregation

即词项分桶聚合。它是 Elasticsearch 最常用的聚合，类同于关系型数据库依据关键字段做 group。

size：返回的词项分桶数量，默认 10。阈值 65535。默认情况下，协调节点向每个分片请求 top size 数量的词项桶，并且在所有分片都响应后，将结果进行缩减并返回给客户端。如果实际词项桶的数量大于 size，则返回的词项桶可能会存在偏差并且不准确。
shard_size：每个分片返回的词项分桶数量，默认为 size * 1.5 + 10。shard_size 越大，结果越准确，但是最终计算结果的成本也越高（每个分片上的优先级队列会更大，并且节点与客户端之间的数据传输量也更大）。shard_size 不能比 size 小，否则 Elasticsearch 会覆盖该属性值，将其设置为和 size 一样的值。
min_doc_count：限制返回的词项分桶中对应文档的命中数量必须满足的最小值，默认 1。
shard_min_doc_count：限制每个分片返回的词项分桶中对应文档的命中数量必须满足的最小值，默认 0。必须小于 min_doc_count。建议为 min_doc_count / 分片数。
show_term_doc_count_error：显示词项分桶统计数据的错误信息，默认 false。用来显示聚合返回的每个词项的错误值，错误值表示文档计数在最差情况下的错误。这个错误值在决定 shard_size 参数值时非常有用。
order：指定排序规则。默认按照 doc_count 逆序排序。
missing：指定字段的值不存在时，给予文档的缺省值。默认会忽略。

修改返回的词项分桶的数量的阈值。

PUT _cluster/settings
{
  "transient": {
    "search.max_buckets": 10
  }
}

查询航班目的地最多的 Top 10 国家。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10
      }
    }
  }
}

结果输出如下：

{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "DestCountry" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 3187,
      "buckets" : [
        {
          "key" : "IT",
          "doc_count" : 2371
        },
        {
          "key" : "US",
          "doc_count" : 1987
        },
        {
          "key" : "CN",
          "doc_count" : 1096
        },
        {
          "key" : "CA",
          "doc_count" : 944
        },
        {
          "key" : "JP",
          "doc_count" : 774
        },
        {
          "key" : "RU",
          "doc_count" : 739
        },
        {
          "key" : "CH",
          "doc_count" : 691
        },
        {
          "key" : "GB",
          "doc_count" : 449
        },
        {
          "key" : "AU",
          "doc_count" : 416
        },
        {
          "key" : "PL",
          "doc_count" : 405
        }
      ]
    }
  }
}

词项分桶聚合返回的 doc_count 值是近似的。可以使用参考如下两个指标来判断 doc_count 值是否是准确的。

sum_other_doc_count：查询结果中没有出现的所有词项的文档总数。
doc_count_error_upper_bound：查询结果中没有出现的词项的最大可能文档数。它的值是所有分片返回的最后一个词项的文档数量的和。当 show_term_doc_count_error 参数设置为 true，可以查看每个词项对应的文档数量的误差。这些误差只有当聚合结果按照文档数降序排序时才会被统计。此外，如果按照词项值本身排序、按照文档数升序或者按照子聚合结果排序都无法统计误差，doc_count_error_upper_bound 会返回 -1。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true
      }
    }
  }
}

结果输出如下：

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "DestCountry" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 3187,
      "buckets" : [
        {
          "key" : "IT",
          "doc_count" : 2371,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "US",
          "doc_count" : 1987,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "CN",
          "doc_count" : 1096,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "CA",
          "doc_count" : 944,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "JP",
          "doc_count" : 774,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "RU",
          "doc_count" : 739,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "CH",
          "doc_count" : 691,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "GB",
          "doc_count" : 449,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "AU",
          "doc_count" : 416,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "PL",
          "doc_count" : 405,
          "doc_count_error_upper_bound" : 0
        }
      ]
    }
  }
}

排序

默认按照 doc_count 逆序排序。不推荐按照 doc_count 升序排序或者在子聚合中排序，这会增加统计文档数量的错误。但是如果在单一分片或者聚合使用的字段在索引时用做路由键，这两种情况下却是准确的。

_count：按照数量排序，默认的排序方式。

_key：按照词项排序。

_term：按照词项排序。

按照文档数量的升序排序。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true,
        "order": {
          "_count": "asc"
        }
      }
    }
  }
}

按照 key 的字母顺序升序排序。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true,
        "order": {
          "_key": "asc"
        }
      }
    }
  }
}

查询航班飞行最短的 Top 10 国家。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "order": {
          "min_FlightTimeMin": "desc"
        }
      },
      "aggs": {
        "min_FlightTimeMin": {
          "min": {
            "field": "FlightTimeMin"
          }
        }
      }
    }
  }
}

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "order": {
          "stats_FlightTimeMin.max": "desc"
        }
      },
      "aggs": {
        "stats_FlightTimeMin": {
          "stats": {
            "field": "FlightTimeMin"
          }
        }
      }
    }
  }
}

词项嵌套分桶

嵌套深度建议不要超过三层。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true
      },
      "aggs": {
        "OriginCountry": {
          "terms": {
            "field": "OriginCountry",
            "size": 10
          }
        }
      }
    }
  }
}

脚本方式

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "script": {
          "source": """
            doc['DestCountry'].value + "-DEMO";
          """,
          "lang": "painless"
        },
        "size": 10
      }
    }
  }
}

结果输出如下：

{
  "took" : 56,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "DestCountry" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 3187,
      "buckets" : [
        {
          "key" : "IT-DEMO",
          "doc_count" : 2371
        },
        {
          "key" : "US-DEMO",
          "doc_count" : 1987
        },
        {
          "key" : "CN-DEMO",
          "doc_count" : 1096
        },
        {
          "key" : "CA-DEMO",
          "doc_count" : 944
        },
        {
          "key" : "JP-DEMO",
          "doc_count" : 774
        },
        {
          "key" : "RU-DEMO",
          "doc_count" : 739
        },
        {
          "key" : "CH-DEMO",
          "doc_count" : 691
        },
        {
          "key" : "GB-DEMO",
          "doc_count" : 449
        },
        {
          "key" : "AU-DEMO",
          "doc_count" : 416
        },
        {
          "key" : "PL-DEMO",
          "doc_count" : 405
        }
      ]
    }
  }
}

词项分桶键值过滤

支持精确键值过滤与模糊匹配方式（通过正则表达式）过滤。个人发现 Elasticsearch 7.14 版本模糊匹配方式实际操作中不生效。

include：包括指定的词项分桶值。
exclude：不包括指定的词项分桶值。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true,
        "include": ["US", "IT", "CN"]
      }
    }
  }
}

结果输出如下：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "DestCountry" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "IT",
          "doc_count" : 2371,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "US",
          "doc_count" : 1987,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "CN",
          "doc_count" : 1096,
          "doc_count_error_upper_bound" : 0
        }
      ]
    }
  }
}

词项分桶分区聚合

很多时候需要统计的分桶数量太多，导致一次运行很慢。可以借助分区机制，在客户端进行合并。对所有可能的分桶进行分区。

partition：分区编号。从 0 开始。

num_partitions：分区数量。

GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "DestCountry": {
      "terms": {
        "field": "DestCountry",
        "size": 10,
        "show_term_doc_count_error": true,
        "include": {
          "partition": 2,
          "num_partitions": 5
        }
      }
    }
  }
}