有时候,一个 select 语句中包含多个窗口函数,它们的窗口定义(OVER 子句)可能相同、也可能不同。
对于相同的窗口,完全没必要再做一次分区和排序,我们可以将它们合并成一个 Window 算子。
select
id,
sq,
cell_type,
rank,
row_number() over(partition by id order by rank ) naturl_rank,
rank() over(partition by id order by rank) as r,
dense_rank() over(partition by cell_type order by id) as dr
from window_test_table
group by id,sq,cell_type,rank;
row_number() , rank() 的窗口一样,可以放在一次分区和排序中完成,这一块hive sql与spark sql的表现是一致的。
但对于另外一种情况:
select
id,
rank,
row_number() over(partition by id order by rank ) naturl_rank,
sum(rank) over(partition by id) as snum
from window_test_table
虽然这 2 个窗口并非完全一致,但是 sum(rank) 不关心分区内的顺序,完全可以复用 row_number() 的窗口。
从下面执行计划可以看出,spark sql sum(rank) 和row_number() 复用了同一个窗口,而hive sql没有。
spark sql的执行计划:
spark-sql> explain select id,rank,row_number() over(partition by id order by rank ) naturl_rank,sum(rank) over(partition by id) as snum from window_test_table;
== Physical Plan ==
*(3) Project [id#13, rank#16, naturl_rank#8, snum#9L]
+- Window [row_number() windowspecdefinition(id#13, rank#16 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS naturl_rank#8], [id#13], [rank#16 ASC NULLS FIRST]
+- *(2) Sort [