网摘:AQE(Adaptive Query Execution)自适应查询,是Spark 3.0开始增加的一种机制,可以根据 Shuffle Map阶段的统计信息,基于预设的规则 动态 地调整和修正尚未执行的逻辑计划和物理计划,来完成对原始查询语句的运行时优化。
该文基于Spark 3.2.0版本,进行AQE的配置以及源码说明,主要的源码实现都在下面这个包:
org.apache.spark.sql.execution.adaptive
1. AQE流程梳理
先简单根据实际代码下一个结论,生成物理执行计划之后,根据AQE和其他的计算规则,选中一个最终待支持的物理计划,在QueryExecution代码里大概就是这样:
sparkPlan -> AQE -> executedPlan
lazy val sparkPlan: SparkPlan = withCteMap {
// We need to materialize the optimizedPlan here because sparkPlan is also tracked under
// the planning phase
assertOptimized()
executePhase(QueryPlanningTracker.PLANNING) {
// Clone the logical plan here, in case the planner rules change the states of the logical
// plan.
QueryExecution.createSparkPlan(sparkSession, planner, optimizedPlan.clone())
}
}
// executedPlan should not be used to initialize any SparkPlan. It should be
// only used for execution.
lazy val executedPlan: SparkPlan = withCteMap {
// We need to materialize the optimizedPlan here, before tracking the planning phase, to ensure
// that the optimization time is not counted as part of the planning phase.
assertOptimized()
executePhase(QueryPlanningTracker.PLANNING) {
// clone the plan to avoid sharing the plan instance between different stages like analyzing,
// optimizing and planning.
QueryExecution.prepareForExecution(preparations, sparkPlan.clone())
}
}
1.1 AQE配置说明
spark.sql.adaptive.enabled 这个是AQE的开关配置,默认是打开的。
spark.sql.adaptive.forceApply 这个是AQE强制执行的开关配置,AQE会跳过没有shuffle的查询或者没有子查询的查询,这个配置默认是会跳过没有shuffle或者没有子查询的执行计划,,因为在这些场景下AQE对于性能提升没有帮助。这个配置是internal的,即外部的配置是更改不了的。
//AQE是否打开
val ADAPTIVE_EXECUTION_ENABLED = buildConf("spark.sql.adaptive.enabled")
.doc("When true, enable adaptive query execution, which re-optimizes the query plan in the " +
"middle of query execution, based on accurate runtime statistics.")
.version("1.6.0")
.booleanConf
.createWithDefault(true)
//AQE会跳过没有shuffle的查询或者没有子查询的查询,这个配置默认是关闭的,即对于没有shuffle或者没有子查询的执行计划,直接跳过了,因为对于性能提升没有帮助,这个配置是internal的,即外部的配置是更改不了的。
val ADAPTIVE_EXECUTION_FORCE_APPLY = buildConf("spark.sql.adaptive.forceApply")