Spark AQE 配置和源码说明

lilyjoke

已于 2023-07-13 17:52:03 修改

阅读量832

点赞数 1

CC 4.0 BY-SA版权

文章标签： spark 大数据分布式

于 2023-07-05 22:02:16 首次发布

本文链接：https://blog.csdn.net/lilyjoke/article/details/131480204

本文深入探讨Spark 3.2.0的Adaptive Query Execution (AQE)机制，详细介绍了AQE的配置选项，如`spark.sql.adaptive.enabled`，以及调用流程，包括`toRdd`、`executedPlan`、`prepareForExecution`等关键步骤。文章还概述了AQE在Stage创建前后的优化规则及其应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

网摘：AQE（Adaptive Query Execution）自适应查询，是Spark 3.0开始增加的一种机制，可以根据 Shuffle Map阶段的统计信息，基于预设的规则动态 地调整和修正尚未执行的逻辑计划和物理计划，来完成对原始查询语句的运行时优化。

该文基于Spark 3.2.0版本，进行AQE的配置以及源码说明，主要的源码实现都在下面这个包:

org.apache.spark.sql.execution.adaptive

1. AQE流程梳理

先简单根据实际代码下一个结论，生成物理执行计划之后，根据AQE和其他的计算规则，选中一个最终待支持的物理计划，在QueryExecution代码里大概就是这样：

sparkPlan -> AQE -> executedPlan

  lazy val sparkPlan: SparkPlan = withCteMap {
    // We need to materialize the optimizedPlan here because sparkPlan is also tracked under
    // the planning phase
    assertOptimized()
    executePhase(QueryPlanningTracker.PLANNING) {
      // Clone the logical plan here, in case the planner rules change the states of the logical
      // plan.
      QueryExecution.createSparkPlan(sparkSession, planner, optimizedPlan.clone())
    }
  }

  // executedPlan should not be used to initialize any SparkPlan. It should be
  // only used for execution.
  lazy val executedPlan: SparkPlan = withCteMap {
    // We need to materialize the optimizedPlan here, before tracking the planning phase, to ensure
    // that the optimization time is not counted as part of the planning phase.
    assertOptimized()
    executePhase(QueryPlanningTracker.PLANNING) {
      // clone the plan to avoid sharing the plan instance between different stages like analyzing,
      // optimizing and planning.
      QueryExecution.prepareForExecution(preparations, sparkPlan.clone())
    }
  }

1.1 AQE配置说明

spark.sql.adaptive.enabled 这个是AQE的开关配置，默认是打开的。

spark.sql.adaptive.forceApply 这个是AQE强制执行的开关配置，AQE会跳过没有shuffle的查询或者没有子查询的查询，这个配置默认是会跳过没有shuffle或者没有子查询的执行计划，，因为在这些场景下AQE对于性能提升没有帮助。这个配置是internal的，即外部的配置是更改不了的。

//AQE是否打开
val ADAPTIVE_EXECUTION_ENABLED = buildConf("spark.sql.adaptive.enabled")
    .doc("When true, enable adaptive query execution, which re-optimizes the query plan in the " +
      "middle of query execution, based on accurate runtime statistics.")
    .version("1.6.0")
    .booleanConf
    .createWithDefault(true)


//AQE会跳过没有shuffle的查询或者没有子查询的查询，这个配置默认是关闭的，即对于没有shuffle或者没有子查询的执行计划，直接跳过了，因为对于性能提升没有帮助，这个配置是internal的，即外部的配置是更改不了的。
val ADAPTIVE_EXECUTION_FORCE_APPLY = buildConf("spark.sql.adaptive.forceApply")