Revert "Optimize order of GROUP BY keys".

This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and several follow-on fixes. The idea of making a cost-based choice of the order of the sorting columns is not fundamentally unsound, but it requires cost information and data statistics that we don't really have. For example, relying on procost to distinguish the relative costs of different sort comparators is pretty pointless so long as most such comparator functions are labeled with cost 1.0. Moreover, estimating the number of comparisons done by Quicksort requires more than just an estimate of the number of distinct values in the input: you also need some idea of the sizes of the larger groups, if you want an estimate that's good to better than a factor of three or so. That's data that's often unknown or not very reliable. Worse, to arrive at estimates of the number of calls made to the lower-order-column comparison functions, the code needs to make estimates of the numbers of distinct values of multiple columns, which are necessarily even less trustworthy than per-column stats. Even if all the inputs are perfectly reliable, the cost algorithm as-implemented cannot offer useful information about how to order sorting columns beyond the point at which the average group size is estimated to drop to 1. Close inspection of the code added by db0d67db2 shows that there are also multiple small bugs. These could have been fixed, but there's not much point if we don't trust the estimates to be accurate in-principle. Finally, the changes in cost_sort's behavior made for very large changes (often a factor of 2 or so) in the cost estimates for all sorting operations, not only those for multi-column GROUP BY. That naturally changes plan choices in many situations, and there's precious little evidence to show that the changes are for the better. Given the above doubts about whether the new estimates are really trustworthy, it's hard to summon much confidence that these changes are better on the average. Since we're hard up against the release deadline for v15, let's revert these changes for now. We can always try again later. Note: in v15, I left T_PathKeyInfo in place in nodes.h even though it's unreferenced. Removing it would be an ABI break, and it seems a bit late in the release cycle for that. Discussion: https://postgr.es/m/TYAPR01MB586665EB5FB2C3807E893941F5579@TYAPR01MB5866.jpnprd01.prod.outlook.com
author: Tom Lane 2022-10-03 14:56:16 +0000
committer: Tom Lane 2022-10-03 14:56:16 +0000
commit: f4c7c410ee4a7baa06f51ebb8d5333c169691dd3 (patch)
tree: 8b0811e2be7edf69c6e2216af085112335374b1b /src/test/regress/expected/partition_aggregate.out
parent: f60eb3f2827db292edf71bb7296fbdf5958ace3d (diff)
1 files changed, 23 insertions, 21 deletions
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index db36e3a150a..a82b8fb8fb7 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -949,12 +949,12 @@ SET parallel_setup_cost = 0;
 -- is not partial agg safe.
 EXPLAIN (COSTS OFF)
 SELECT a, sum(b), array_agg(distinct c), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3;
-                                         QUERY PLAN                                         
---------------------------------------------------------------------------------------------
- Gather Merge
-   Workers Planned: 2
-   ->  Sort
-         Sort Key: pagg_tab_ml.a, (sum(pagg_tab_ml.b)), (array_agg(DISTINCT pagg_tab_ml.c))
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: pagg_tab_ml.a, (sum(pagg_tab_ml.b)), (array_agg(DISTINCT pagg_tab_ml.c))
+   ->  Gather
+         Workers Planned: 2
          ->  Parallel Append
                ->  GroupAggregate
                      Group Key: pagg_tab_ml.a
@@ -1381,26 +1381,28 @@ SELECT x, sum(y), avg(y), count(*) FROM pagg_tab_para GROUP BY x HAVING avg(y) <
 -- When GROUP BY clause does not match; partial aggregation is performed for each partition.
 EXPLAIN (COSTS OFF)
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
-                                     QUERY PLAN                                      
--------------------------------------------------------------------------------------
+                                        QUERY PLAN                                         
+-------------------------------------------------------------------------------------------
  Sort
    Sort Key: pagg_tab_para.y, (sum(pagg_tab_para.x)), (avg(pagg_tab_para.x))
-   ->  Finalize HashAggregate
+   ->  Finalize GroupAggregate
          Group Key: pagg_tab_para.y
          Filter: (avg(pagg_tab_para.x) < '12'::numeric)
-         ->  Gather
+         ->  Gather Merge
                Workers Planned: 2
-               ->  Parallel Append
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_para.y
-                           ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_para_1.y
-                           ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
-                     ->  Partial HashAggregate
-                           Group Key: pagg_tab_para_2.y
-                           ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
-(17 rows)
+               ->  Sort
+                     Sort Key: pagg_tab_para.y
+                     ->  Parallel Append
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_para.y
+                                 ->  Parallel Seq Scan on pagg_tab_para_p1 pagg_tab_para
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_para_1.y
+                                 ->  Parallel Seq Scan on pagg_tab_para_p2 pagg_tab_para_1
+                           ->  Partial HashAggregate
+                                 Group Key: pagg_tab_para_2.y
+                                 ->  Parallel Seq Scan on pagg_tab_para_p3 pagg_tab_para_2
+(19 rows)
 
 SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y HAVING avg(x) < 12 ORDER BY 1, 2, 3;
  y  |  sum  |         avg         | count
author	Tom Lane	2022-10-03 14:56:16 +0000
committer	Tom Lane	2022-10-03 14:56:16 +0000
commit	f4c7c410ee4a7baa06f51ebb8d5333c169691dd3 (patch)
tree	8b0811e2be7edf69c6e2216af085112335374b1b /src/test/regress/expected/partition_aggregate.out
parent	f60eb3f2827db292edf71bb7296fbdf5958ace3d (diff)