Quick Links

Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Lists:	pgsql-hackers

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-10 18:00:15
Message-ID:	CAJVSvF6s1LgXF6KB2Cz68sHzk+v+O_vmwEkaon=H8O9VcOr-tQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I want customers to be able to run large OLAP queries on PostgreSQL,
using as much memory as possible, to avoid spilling — without running
out of memory.

There are other ways to run out of memory, but the fastest and easiest
way, on an OLAP query, is to use a lot of work_mem. (This is true for
any SQL database: SQL operators are “usually” streaming operators...
except for those that use work_mem.) PostgreSQL already supports the
work_mem GUC, and every PostgreSQL operator tries very hard to spill
to disk rather than exceed its work_mem limit. For now, I’m not
concerned about other ways for queries to run out of memory — just
work_mem.

I like the way PostgreSQL operators respect work_mem, but I can’t find
a good way to set the work_mem GUC. Oracle apparently had the same
problem, with their RDBMS, 20 years ago [1]:

“In releases earlier than Oracle Database 10g, the database
administrator controlled the maximum size of SQL work areas by setting
the following parameters: SORT_AREA_SIZE, HASH_AREA_SIZE, ... Setting
these parameters is difficult, because the maximum work area size is
ideally selected from the data input size and the total number of work
areas active in the system. These two factors vary greatly from one
work area to another and from one time to another. Thus, the various
*_AREA_SIZE parameters are difficult to tune under the best of
circumstances.

“For this reason, Oracle strongly recommends that you leave automatic
PGA memory management enabled.”

It’s difficult to tune PostgreSQL’s work_mem and hash_mem_multiplier
GUCs, under the best of circumstances, yeah. The work_mem and
hash_mem_multiplier GUCs apply to all operators of a given type, even
though two operators of the same type, even in the same query, might
need vastly different amounts of work_mem.

I would like a “query_work_mem” GUC, similar to what’s proposed in
[2]. This GUC would be useful on its own, because it would be much
easier to tune than the existing work_mem + hash_mem_multiplier GUCs;
and it would serve as a milestone on a path to my ultimate goal of
something like Oracle’s “automatic PGA memory management.”

I call it “query_work_mem,” rather than “max_total_backend_memory,”
because (a) for now, I care only about limiting work_mem, I’ll deal
with other types of memory separately; and (b) “query” instead of
“backend” avoids ambiguity over how much memory a recursively-compiled
query gets.

(Re (b), see “crosstab()”, [3]. The “sql text” executed by crosstab()
would get its own query_work_mem allocation, separate from the query
that called the crosstab() function.)

The main problem I have with the “max_total_backend_memory” proposal,
however, is that it “enforces” its limit by killing the offending
query. This seems an overreaction to me, especially since PostgreSQL
operators already know how to spill to disk. If a customer’s OLAP
query exceeds its memory limit by 1%, I would rather spill 1% of their
data to disk, instead of cancelling their entire query.

(And if their OLAP query exceeds its memory limit by 1,000x... I still
don’t want PostgreSQL to preemptively cancel it, because either the
customer ends up OK with the overall performance, even with the
spilling; or else they decide the query is taking too long, and cancel
it themselves. I don’t want to be in the business of preemptively
cancelling customer queries.)

So, I want a query_work_mem GUC, and I want PostgreSQL to distribute
that total query_work_mem to the query’s individual SQL operators, so
that each operator will spill rather than exceed its per-operator
limit.

Making query_work_mem a session GUC makes it feasible for a DBA or an
automated system to distribute memory from a global memory pool among
individual queries, e.g. via pg_hint_plan(). So (as mentioned above),
“query_work_mem” is useful to a DBA, and also a step toward a
fully-automated memory-management system.

How should “query_work_mem” work? Let’s start with an example: suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem. Suppose we’re pretty sure that one of the Hash
Joins will use 10 KB of work_mem, while the other will use 1 GB. And
suppose we know that the PostgreSQL instance has 1 GB of memory
available, for use by our OLAP query. (Perhaps we reserve 1 GB for
OLAP queries, and we allow only 1 OLAP query at a time; or perhaps we
have some sort of dynamic memory manager.)

How should we configure PostgreSQL so that our OLAP query spills as
little as possible, without running out of memory?

-- First, we could just say: 2 operators, total available working
memory is 1 GB — give each operator 512 MB. Then we would spill 512 MB
from the large Hash Join, because we’d waste around 512 MB for the
small Hash Join. We’re undersubscribing, to be safe, but our
performance suffers. That’s bad! We’re basically wasting memory that
the query would like to use.

-- Second, we could say, instead: the small Hash Join is *highly
unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
expecting that the small Hash Join won’t use more than 1 MB of its
1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
oversubscribing, betting that the small Hash Join will just stay
within some smaller, “unenforced” memory limit.

In this example, this bet is probably fine — but it won’t work in
general. I don’t want to be in the business of gambling with customer
resources: if the small Hash Join is unlikely to use more than 1 MB,
then let’s just assign it 1 MB of work_mem. That way, if I’m wrong,
the customer’s query will just spill, instead of running out of
memory. I am very strongly opposed to cancelling queries if/when we
can just spill to disk.

-- Third, we could just rewrite the existing “work_mem” logic so that
all of the query’s operators draw, at runtime, from a single,
“query_work_mem” pool. So, an operator won’t spill until all of
“query_work_mem” is exhausted — by the operator itself, or by some
other operator in the same query.

But doing that runs into starvation/fairness problems, where an
unlucky operator, executing later in the query, can’t get any
query_work_mem, because earlier, greedy operators used up all of it.

The solution I propose here is just to distribute the “query_work_mem”
into individual, per-operator, work_mem limits.

**Proposal:**

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

To enable this to-be-described algorithm, I would add an “nbytes”
field to the Path struct, and display this (and related info) in
EXPLAIN PLAN. So the customer will be able to see how much work_mem
the SQL compiler thinks they’ll need, per operator; and so will the
algorithm.

I wouldn’t change the existing planning logic (at least not in the
initial implementaton). So the existing planning logic would choose
between different SQL operators, still on the assumption that every
operator that needs working memory will get work_mem [*
hash_mem_multiplier]. This assumption might understate or overstate
the actual working memory we’ll give the operator, at runtime. If it
understates, the planner will be biased in favor of operators that
don’t use much working memory. If it overstates, the planner will be
biased in favor of operators that use too much working memory.

(We could add a feedback loop to the planner, or even something simple
like generating multiple path, at different “work_mem” limits, but
everything I can think of here adds complexity without much potential
benefit. So I would defer any changes to the planner behavior until
later, if ever.)

The to-be-described algorithm would look at a query’s Paths’ “nbytes”
fields, as well as the session “work_mem” GUC (which would, now, serve
as a hint to the SQL compiler), and decide how much of
“query_work_mem” to assign to the corresponding Plan node.

It would assign that limit to a new “work_mem” field, on the Plan
node. And this limit would also be exposed, of course, in EXPLAIN
ANALYZE, along with the actual work_mem usage, which might very well
exceed the limit. This will let the customer know when a query spills,
and why.

I would write the algorithm to maintain the existing work_mem
behavior, as much as possible. (Backward compatibility is good!) Most
likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
*minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
work_mem [* hash _mem_multiplier] , for all operators in the query,
all operators would be assigned at least work_mem, which would make my
proposal a Pareto improvement.

Last, at runtime, each PlanState would check its plan -> work_mem
field, rather than the global work_mem GUC. Execution would otherwise
be the same as today.

What do you think?

James

[1] https://docs.oracle.com/en//database/oracle/oracle-database/23/admin/managing-memory.html#GUID-8D7FC70A-56D8-4CA1-9F74-592F04172EA7
[2] https://www.postgresql.org/message-id/flat/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel%40crunchydata.com
[3] https://www.postgresql.org/docs/current/tablefunc.html

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-21 21:26:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
> How should “query_work_mem” work? Let’s start with an example:
> suppose
> we have an OLAP query that has 2 Hash Joins, and no other operators
> that use work_mem.

So we plan first, and then assign available memory afterward? If we do
it that way, then the costing will be inaccurate, because the original
costs are based on the original work_mem.

It may be better than killing the query, but not ideal.

> -- Second, we could say, instead: the small Hash Join is *highly
> unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
> expecting that the small Hash Join won’t use more than 1 MB of its
> 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
> oversubscribing, betting that the small Hash Join will just stay
> within some smaller, “unenforced” memory limit.
>
> In this example, this bet is probably fine — but it won’t work in
> general. I don’t want to be in the business of gambling with customer
> resources: if the small Hash Join is unlikely to use more than 1 MB,
> then let’s just assign it 1 MB of work_mem.

I like this idea. Operators that either know they don't need much
memory, or estimate that they don't need much memory, can constrain
themselves. That would protect against misestimations and advertise to
the higher levels of the planner how much memory the operator actually
wants. Right now, the planner doesn't know which operators need a lot
of memory and which ones don't need any significant amount at all.

The challenge, of course, is what the higher levels of the planner
would do with that information, which goes to the rest of your
proposal. But tracking the information seems very reasonable to me.

> I propose that we add a “query_work_mem” GUC, which works by
> distributing (using some algorithm to be described in a follow-up
> email) the entire “query_work_mem” to the query’s operators. And then
> each operator will spill when it exceeds its own work_mem limit. So
> we’ll preserve the existing “spill” logic as much as possible.

The description above sounds too "top-down" to me. That may work, but
has the disadvantage that costing has already happened. We should also
consider:

* Reusing the path generation infrastructure so that both "high memory"
and "low memory" paths can be considered, and if a path requires too
much memory in aggregate, then it would be rejected in favor of a path
that uses less memory. This feels like it fits within the planner
architecture the best, but it also might lead to a path explosion, so
we may need additional controls.

* Some kind of negotiation where the top level of the planner finds
that the plan uses too much memory, and replans some or all of it. (I
think is similar to what you described as the "feedback loop" later in
your email.) I agree that this is complex and may not have enough
benefit to justify.

Regards,
Jeff Davis

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-22 20:48:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/21/25 22:26, Jeff Davis wrote:
> On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
>> How should “query_work_mem” work? Let’s start with an example:
>> suppose
>> we have an OLAP query that has 2 Hash Joins, and no other operators
>> that use work_mem.
>
> So we plan first, and then assign available memory afterward? If we do
> it that way, then the costing will be inaccurate, because the original
> costs are based on the original work_mem.
>
> It may be better than killing the query, but not ideal.
>
>> -- Second, we could say, instead: the small Hash Join is *highly
>> unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
>> expecting that the small Hash Join won’t use more than 1 MB of its
>> 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
>> oversubscribing, betting that the small Hash Join will just stay
>> within some smaller, “unenforced” memory limit.
>>
>> In this example, this bet is probably fine — but it won’t work in
>> general. I don’t want to be in the business of gambling with customer
>> resources: if the small Hash Join is unlikely to use more than 1 MB,
>> then let’s just assign it 1 MB of work_mem.
>
> I like this idea. Operators that either know they don't need much
> memory, or estimate that they don't need much memory, can constrain
> themselves. That would protect against misestimations and advertise to
> the higher levels of the planner how much memory the operator actually
> wants. Right now, the planner doesn't know which operators need a lot
> of memory and which ones don't need any significant amount at all.
>

I'm not sure I like the idea that much.

At first restricting the operator to the amount the optimizer predicts
will be needed seems reasonable, because that's generally the best idea
of memory usage we have without running the query.

But these estimates are often pretty fundamentally unreliable - maybe
not for simple examples, but once you put an aggregate on top of a join,
the errors can be pretty wild. And allowing the operator to still use
more work_mem makes this more adaptive. I suspect forcing the operator
to adhere to the estimated work_mem might make this much worse (but I
haven't tried, maybe spilling to temp files is not that bad).

> The challenge, of course, is what the higher levels of the planner
> would do with that information, which goes to the rest of your
> proposal. But tracking the information seems very reasonable to me.
>

I agree. Tracking additional information seems like a good idea, but
it's not clear to me what would the planner use this. I can imagine
various approaches - e.g. we might do the planning as usual and then
distribute the query_work_mem between the nodes in proportion to the
estimated amount of memory. But it all seems like a very ad hoc
heuristics, and easy to confuse / make the wrong decision.

>> I propose that we add a “query_work_mem” GUC, which works by
>> distributing (using some algorithm to be described in a follow-up
>> email) the entire “query_work_mem” to the query’s operators. And then
>> each operator will spill when it exceeds its own work_mem limit. So
>> we’ll preserve the existing “spill” logic as much as possible.
>
> The description above sounds too "top-down" to me. That may work, but
> has the disadvantage that costing has already happened. We should also
> consider:
>
> * Reusing the path generation infrastructure so that both "high memory"
> and "low memory" paths can be considered, and if a path requires too
> much memory in aggregate, then it would be rejected in favor of a path
> that uses less memory. This feels like it fits within the planner
> architecture the best, but it also might lead to a path explosion, so
> we may need additional controls.
>
> * Some kind of negotiation where the top level of the planner finds
> that the plan uses too much memory, and replans some or all of it. (I
> think is similar to what you described as the "feedback loop" later in
> your email.) I agree that this is complex and may not have enough
> benefit to justify.
>

Right, it seems rather at odds with the bottom-up construction of paths.
The amount of memory an operator may use seems like a pretty fundamental
information, but if it's available only after the whole plan is built,
that seems ... not great.

I don't know if generating (and keeping) low/high-memory paths is quite
feasible. Isn't that really a continuum for many paths? A hash join may
need very little memory (with batching) or a lot of memory (if keeping
everything in memory), so how would this work? Would we generate paths
for a range of work_mem values (with different costs)?

regards

--
Tomas Vondra

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-22 21:13:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/10/25 19:00, James Hunter wrote:
> ...
>
> **Proposal:**
>
> I propose that we add a “query_work_mem” GUC, which works by
> distributing (using some algorithm to be described in a follow-up
> email) the entire “query_work_mem” to the query’s operators. And then
> each operator will spill when it exceeds its own work_mem limit. So
> we’ll preserve the existing “spill” logic as much as possible.
>
> To enable this to-be-described algorithm, I would add an “nbytes”
> field to the Path struct, and display this (and related info) in
> EXPLAIN PLAN. So the customer will be able to see how much work_mem
> the SQL compiler thinks they’ll need, per operator; and so will the
> algorithm.
>
> I wouldn’t change the existing planning logic (at least not in the
> initial implementaton). So the existing planning logic would choose
> between different SQL operators, still on the assumption that every
> operator that needs working memory will get work_mem [*
> hash_mem_multiplier].

All this seems generally feasible, but it hinges on the to-be-described
algorithm can distribute the memory in a sensible way that doesn't
affect the costing too much. If we plan a hash join with nbatch=1, and
then come back and make it to use nbatch=1024, maybe we wouldn't have
used hash join at all. Not sure.

The fundamental issue seems to be not having any information about how
much memory might be available to the operator. And in principle we
can't have that during the bottom-up part of the planning, until after
we construct the whole plan. Only at that point we know how many
operators will need work_mem.

Could we get at least some initial estimate how many such operators the
query *might* end up using? Maybe that'd be just enough a priori
information to set the effective work_mem for the planning part to make
this practical.

> This assumption might understate or overstate
> the actual working memory we’ll give the operator, at runtime. If it
> understates, the planner will be biased in favor of operators that
> don’t use much working memory. If it overstates, the planner will be
> biased in favor of operators that use too much working memory.
>

I'm not quite sure I understand this part. Could you elaborate?

> (We could add a feedback loop to the planner, or even something simple
> like generating multiple path, at different “work_mem” limits, but
> everything I can think of here adds complexity without much potential
> benefit. So I would defer any changes to the planner behavior until
> later, if ever.)
>

What would be the feedback? I can imagine improving the estimate of how
much memory a given operator needs during the bottom-up phase, but it
doesn't quite help with knowing what will happen above the current node.

> The to-be-described algorithm would look at a query’s Paths’ “nbytes”
> fields, as well as the session “work_mem” GUC (which would, now, serve
> as a hint to the SQL compiler), and decide how much of
> “query_work_mem” to assign to the corresponding Plan node.
>
> It would assign that limit to a new “work_mem” field, on the Plan
> node. And this limit would also be exposed, of course, in EXPLAIN
> ANALYZE, along with the actual work_mem usage, which might very well
> exceed the limit. This will let the customer know when a query spills,
> and why.
>
> I would write the algorithm to maintain the existing work_mem
> behavior, as much as possible. (Backward compatibility is good!) Most
> likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
> *minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
> work_mem [* hash _mem_multiplier] , for all operators in the query,
> all operators would be assigned at least work_mem, which would make my
> proposal a Pareto improvement.
>
> Last, at runtime, each PlanState would check its plan -> work_mem
> field, rather than the global work_mem GUC. Execution would otherwise
> be the same as today.
>
> What do you think?
>

I find it a bit hard to discuss an abstract proposal, without knowing
the really crucial ingredient. It might be helpful to implement some
sort of PoC of this approach, I'm sure that'd give us a lot of insights
and means to experiment with it (instead of just speculating about what
might happen).

regards

--
Tomas Vondra

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>, James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-22 22:42:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2025-01-22 at 21:48 +0100, Tomas Vondra wrote:
> But these estimates are often pretty fundamentally unreliable - maybe
> not for simple examples, but once you put an aggregate on top of a
> join,
> the errors can be pretty wild.

It would be conditional on whether there's some kind of memory
constraint or not. Setting aside the difficulty of implementing a new
memory constraint, if we assume there is one, then it would be good to
know how much memory an operator estimates that it needs.

(Also, if extra memory is available, spill files will be able to use
the OS filesystem cache, which mitigates the spilling cost.)

Another thing that would be good to know is about concurrent memory
usage. That is, if it's a blocking executor node, then it can release
all the memory from child nodes when it completes. Therefore the
concurrent memory usage might be less than just the sum of memory used
by all operators in the plan.

> I don't know if generating (and keeping) low/high-memory paths is
> quite
> feasible. Isn't that really a continuum for many paths? A hash join
> may
> need very little memory (with batching) or a lot of memory (if
> keeping
> everything in memory), so how would this work? Would we generate
> paths
> for a range of work_mem values (with different costs)?

A range might cause too much of an explosion. Let's do something simple
like define "low" to mean 1/16th, or have a separate low_work_mem GUC
(that could be an absolute number or a fraction).

There are a few ways we could pass the information down. We could just
have every operator generate twice as many paths (at least those
operators that want to use as much memory as they can get). Or we could
pass down the query_work_mem by subtracting the current operator's
memory needs and dividing what's left among its input paths.

We may have to track extra information to make sure that high-memory
paths don't dominate low-memory paths that are still useful (similar to
path keys).

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-25 00:48:59
Message-ID:	CAJVSvF6DG-KDdRwMZrK4pg1ehRAhgAGYZS2_=FhUYso+cLGsDw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 22, 2025 at 1:13 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 1/10/25 19:00, James Hunter wrote:
> > ...
> > I wouldn’t change the existing planning logic (at least not in the
> > initial implementaton). So the existing planning logic would choose
> > between different SQL operators, still on the assumption that every
> > operator that needs working memory will get work_mem [*
> > hash_mem_multiplier].
>
> All this seems generally feasible, but it hinges on the to-be-described
> algorithm can distribute the memory in a sensible way that doesn't
> affect the costing too much. If we plan a hash join with nbatch=1, and
> then come back and make it to use nbatch=1024, maybe we wouldn't have
> used hash join at all. Not sure.

I see two problems:
(a) For a given query plan, minimizing spilling without exceeding
memory limits (and thus crashing);
(b) For a given amount of available memory, choosing the optimal plan.

Both problems exist today, and PostgreSQL offers the same tools to
address both: work_mem and hash_mem_multiplier. I argue that these
tools are inadequate for (a), but I think they work reasonably well
for (b).

I propose to solve problem (a), but not (b).

In your example, the reason PostgreSQL plans a Hash Join, with nbatch
= 1, is because the planner's "nbytes" estimate for working memory is
< (hash_mem_multiplier * work_mem). This implicitly assumes that the
PostgreSQL *instance* has at least that memory available.

If it turns out that the instance doesn't have that much memory
available, then the Hash Join will crash. That's the current behavior.

It would be better if, instead, we used nbatch=1024 for the Hash Join,
so we *didn't* crash. (Note that your example implies that work_mem is
set to 1,024x available memory!) This is problem (a). But then, as you
point out, it might be *even better* if we gave up on Hash Join
altogether, and just went with Nested Loop. This is problem (b).

Today, "work_mem" can be set too high or too low. I argue that there's
no way to avoid one or the other, and:
1. If "work_mem" is set too high -- PostgreSQL currently crashes. With
my proposal, it (a) would not crash, but (b) would possibly execute a
sub-optimal plan.
2. If "work_mem" is set too low -- PostgreSQL currently (a) spills
unnecessarily, or (b) chooses a sub-optimal plan. With my proposal, it
would (a) not spill unnecessarily, but would (b) still execute a
sub-optimal plan (if chosen).

I am not proposing to solve (b), the generation of optimal plans, when
memory constraints are a priori unknown. I see that as a separate,
lower-priority problem -- one that would require multi-pass
compilation, which I would like to avoid.

> The fundamental issue seems to be not having any information about how
> much memory might be available to the operator. And in principle we
> can't have that during the bottom-up part of the planning, until after
> we construct the whole plan. Only at that point we know how many
> operators will need work_mem.

It's not just bottom-up vs. top-down: it's multi-pass. I think you
would need something beyond Oracle's Cost-Based Query Transformation
[1], where PostgreSQL generates all possible query path trees, up
front, to get an estimate for the total working-memory requested for
each tree. Then we could distribute "query_work_mem" to each tree, and
then compute the costs once we knew how much working memory each path
node on the tree would actually get.

The algorithm described in the previous paragraph is certainly
*possible*, but it's not remotely *feasible*, in general, because it
requires generating way too many states to cost. For one thing,
instead of costing path nodes individually, it has to cost entire
trees; and there are far more trees than there are tree nodes. For
another: today, PostgreSQL's optimizer goes out of its way not to
generate obviously-bad path nodes, but in the above algorithm, there's
no way to know whether a path is bad until after you've generated the
entire tree.

Anyway, to come up with something feasible, we'd have to apply
heuristics to prune trees, etc.. And it's not clear to me that, after
all of that code complexity and run-time CPU cost, the result would be
any better than just leaving the optimizer as it is. "Better the devil
you know..." etc.

My proposal doesn't try to solve (b), instead relying on the customer
to provide a reasonable "work_mem" estimate, for use by the optimizer.

> Could we get at least some initial estimate how many such operators the
> query *might* end up using? Maybe that'd be just enough a priori
> information to set the effective work_mem for the planning part to make
> this practical.

You could use the # of base relations, as a proxy for # of joins, but
I am not convinced this would improve the optimizer's decision.

> > This assumption might understate or overstate
> > the actual working memory we’ll give the operator, at runtime. If it
> > understates, the planner will be biased in favor of operators that
> > don’t use much working memory. If it overstates, the planner will be
> > biased in favor of operators that use too much working memory.
> >
>
> I'm not quite sure I understand this part. Could you elaborate?

I was just trying to express what you more clearly restated: if
work_mem is too high, vs. the actual memory available on the instance,
then the optimizer will choose Hash Join, even though the optimal
choice might be Nested Loop.

> > (We could add a feedback loop to the planner, or even something simple
> > like generating multiple path, at different “work_mem” limits, but
> > everything I can think of here adds complexity without much potential
> > benefit. So I would defer any changes to the planner behavior until
> > later, if ever.)
> >
>
> What would be the feedback? I can imagine improving the estimate of how
> much memory a given operator needs during the bottom-up phase, but it
> doesn't quite help with knowing what will happen above the current node.

Something like the algorithm I sketched above, in this email, might
work. Of course, it would have to be modified with heuristics, etc.,
to reduce the state space to something manageable...

But my point is just that any feedback loop, or running the optimizer
at different "work_mem" limits, is a bad idea. Leaving the optimizer
as it is seems the least bad choice.

Thanks for your comments,
James

[1] https://dl.acm.org/doi/10.5555/1182635.1164215

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-25 01:04:40
Message-ID:	CAJVSvF7B+z0i=jMvvcWnq=Vh8AHZwe02BSBmOfmGjiu1Nf4_wQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 21, 2025 at 1:26 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
> > How should “query_work_mem” work? Let’s start with an example:
> > suppose
> > we have an OLAP query that has 2 Hash Joins, and no other operators
> > that use work_mem.
>
> So we plan first, and then assign available memory afterward? If we do
> it that way, then the costing will be inaccurate, because the original
> costs are based on the original work_mem.
>
> It may be better than killing the query, but not ideal.

As you point out, the outcome is better, but not ideal. My intuition
is that an "ideal" solution would increase query compilation times
beyond what customers would accept...

But at least the outcome, if not ideal is better than killing the
query! So it is a net improvement.

> > I propose that we add a “query_work_mem” GUC, which works by
> > distributing (using some algorithm to be described in a follow-up
> > email) the entire “query_work_mem” to the query’s operators. And then
> > each operator will spill when it exceeds its own work_mem limit. So
> > we’ll preserve the existing “spill” logic as much as possible.
>
> The description above sounds too "top-down" to me. That may work, but
> has the disadvantage that costing has already happened. We should also
> consider:
>
> * Reusing the path generation infrastructure so that both "high memory"
> and "low memory" paths can be considered, and if a path requires too
> much memory in aggregate, then it would be rejected in favor of a path
> that uses less memory. This feels like it fits within the planner
> architecture the best, but it also might lead to a path explosion, so
> we may need additional controls.
>
> * Some kind of negotiation where the top level of the planner finds
> that the plan uses too much memory, and replans some or all of it. (I
> think is similar to what you described as the "feedback loop" later in
> your email.) I agree that this is complex and may not have enough
> benefit to justify.

Generating "high memory" vs. "low memory" paths would be tricky,
because the definition of "high" vs. "low" depends on the entire path
tree, not just on a single path node. So I think it would quickly lead
to a state-space explosion, as you mention.

And I think negotiation has the same problem: it's based on the entire
tree, not just an individual path node. I think the general problem is
not so much "top-down" vs. "bottom-up", as "individual path node" vs.
"entire path tree." Today, PostgreSQL costs each path node
individually, by referring to the static "work_mem" GUC. In any
attempt to improve the optimizer's choice, I think we'd have to cost
the entire path tree. And there are many more trees than there are
tree nodes.

For example, the decision whether to prefer a Nested Loop vs. a Hash
Join that takes 2 MB of working memory, depends on what the query's
other joins are doing.

At any rate, I think we can solve the problem of "killing the query"
now; and then worry, in the future, about the ideal solution of how to
pick the optimal execution plan.

> Regards,
> Jeff Davis

Thanks for your comments!
James

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-01-25 01:48:44
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:
> Generating "high memory" vs. "low memory" paths would be tricky,
> because the definition of "high" vs. "low" depends on the entire path
> tree, not just on a single path node. So I think it would quickly
> lead
> to a state-space explosion, as you mention.

At first, it appears to lead to an explosion, but there are a lot of
ways to prune early. Many operators, like an index scan, don't even
need to track memory, so they'd just have the one path. Other operators
can just generate a low memory path because estimates show that it's
unlikely to need more than that. And if there's a blocking operator,
then that resets the memory requirement, pruning the space further.

And I assume you are talking about analytic queries with reasonably
large values of work_mem anyway. That justifies a bit more planning
time -- no need to generate extra paths for cheap queries.

Maybe my idea doesn't work out, but I think it's too early to dismiss
it.

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-11 03:09:25
Message-ID:	CAJVSvF68TGsyrvMpjmPviNvR48UYkD6AN-HLQXOu2uj3HCHHGA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 24, 2025 at 5:48 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:
> > Generating "high memory" vs. "low memory" paths would be tricky,
> > because the definition of "high" vs. "low" depends on the entire path
> > tree, not just on a single path node. So I think it would quickly
> > lead
> > to a state-space explosion, as you mention.
>
> At first, it appears to lead to an explosion, but there are a lot of
> ways to prune early. ...
>
> Maybe my idea doesn't work out, but I think it's too early to dismiss
> it.

I think it makes sense to split the work into two parts: one part that
improves SQL execution, and a second part that improves the optimizer,
to reflect the improvements to execution.

It seems better to me to wait until we have the ability to enforce
memory limits, before worrying about ways to generate different paths
with different memory limits. Then we would be able to tune the
optimizer heuristics based on the actual executor, instead of
extrapolating how the executor would behave under different memory
limits.

James

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, "pg(at)bowt(dot)ie" <pg(at)bowt(dot)ie>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-11 03:12:50
Message-ID:	CAJVSvF54WHx=zBj7Q-7XW5-qHP7L+PoZ8cZvuOPKic=rZTWYCQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I hope to have an initial patch-set for a prototype, within the next
couple of weeks. But I wanted to add some design comments to this
thread, first, to solicit feedback, etc. —

First, some bookkeeping: Peter Geoghegan pointed me, offline, to
Oracle’s 2002 paper [1] on how they managed SQL execution memory in
9i. I found it helpful to compare my proposal to what Oracle did. The
main difference I see is that Oracle modified their SQL operators to
“give back” memory, at runtime, when the resource manager reduces the
per-operator memory limit. Doing this causes its own problems, but it
allows Oracle to maintain a single “per-operator” memory limit that
applies to *all* operators; see Figure 6.

I am reluctant to make a similar change to PostgreSQL, because (1) it
would involve a lot of code churn, and (2) it’s not clear to me that
this is a good path to take. Note that the Oracle design allows total
memory to exceed the global memory limit, temporarily, while the
system waits for running operators to give their memory back. So, the
paper describes how Oracle tries to anticipate this situation and
reduce the per-operator memory limit in advance... but I have not had
a good experience with that sort of strategy, in the cloud.

The Oracle design necessarily overprovisions some operators, because
it assigns the same limit to all operators. (See, again, Figure 6,
which makes all of this clearer than anything I could write.) It
relies on detecting when an overprovisioned operator starts to use
more of the memory it was provisioned... and then quickly reducing the
per-operator limit, so that other operators give up their memory for
use by the previously-overprovisioned operator. In this way, the
Oracle design is very fair.

However, while waiting for the other operators to give up their memory
(since they are now oversubscribed), the system temporarily exceeds
the global memory limit. This opens up a can of worms, but it seems
like the Oracle paper deals with this situation by letting the
excessive memory swap to disk (see Figures 10 and 11).

I don’t want to modify PostgreSQL operators so they can give up memory
at runtime. So this forces my solution to do two things: (1) provide
different operators different memory limits, since I can’t take memory
away from an operator after it has started running; and (2) give each
operator (at least) an initial memory reservation, before it starts
running. Hence, the approach I described earlier in this thread.

Second, some motivation: the cloud makes the resource management
problem worse than it is on-premise. I would refer to page 2 of the
Oracle doc (too long to quote here), as justification for moving away
from the “work_mem” GUC, but note that these arguments apply more
strongly to cloud databases, for two reasons. First reason: swap can
be prohibitively expensive in the cloud, and spilling very expensive.
This is because cloud instances frequently lack attached, ephemeral
storage. Cloud remote storage can be extremely slow [2]: “For example,
a gp2 volume under 1,000 GiB with burst credits available has ... a
volume throughput limit of 250 MiB/s.”

Second reason: any cloud provider has an effectively infinite number
of customer instances. I mean that this number is large enough that
the cloud provider cannot afford to manage these instances, except via
automated tools. So, when the Oracle paper says, “ Generally, the DBA
tries to avoid over-allocation by assuming the worst workload in order
to avoid paging (with dramatic degradation in performance) or query
failure.” The degradation is more dramatic in the cloud, and the cost
of under-utilization is higher.

Also also: “In most commercial systems the burden has been put on the
DBA to provide an optimal setting for configuration parameters that
are internally used to decide how much memory to allocate to a given
database operator. This is a challenging task for the DBA...” This is
an impossible task for the cloud provider!

Thanks,
James

[1] https://www.vldb.org/conf/2002/S29P03.pdf
[2] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-characteristics.html#ebs-io-size-throughput-limits

On Mon, Feb 10, 2025 at 7:09 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> On Fri, Jan 24, 2025 at 5:48 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:
> > > Generating "high memory" vs. "low memory" paths would be tricky,
> > > because the definition of "high" vs. "low" depends on the entire path
> > > tree, not just on a single path node. So I think it would quickly
> > > lead
> > > to a state-space explosion, as you mention.
> >
> > At first, it appears to lead to an explosion, but there are a lot of
> > ways to prune early. ...
> >
> > Maybe my idea doesn't work out, but I think it's too early to dismiss
> > it.
>
> I think it makes sense to split the work into two parts: one part that
> improves SQL execution, and a second part that improves the optimizer,
> to reflect the improvements to execution.
>
> It seems better to me to wait until we have the ability to enforce
> memory limits, before worrying about ways to generate different paths
> with different memory limits. Then we would be able to tune the
> optimizer heuristics based on the actual executor, instead of
> extrapolating how the executor would behave under different memory
> limits.
>
> James

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-11 18:00:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2025-02-10 at 19:09 -0800, James Hunter wrote:
> I think it makes sense to split the work into two parts: one part
> that
> improves SQL execution, and a second part that improves the
> optimizer,
> to reflect the improvements to execution.

I like the idea to store the value of work_mem in the
path/plan/executor nodes, and use that at execution time rather than
the GUC directly.

IIUC, that would allow an extension to do what you want, right? A
planner hook could just walk the tree and edit those values for
individual nodes, and the executor would enforce them.

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-11 18:39:51
Message-ID:	CAJVSvF7x_DLj7-JrXvMB4_j+jzuvjG_7iXNjx5KmLBTXHPNdGA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 11, 2025 at 10:00 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Mon, 2025-02-10 at 19:09 -0800, James Hunter wrote:
> > I think it makes sense to split the work into two parts: one part
> > that
> > improves SQL execution, and a second part that improves the
> > optimizer,
> > to reflect the improvements to execution.
>
> I like the idea to store the value of work_mem in the
> path/plan/executor nodes, and use that at execution time rather than
> the GUC directly.
>
> IIUC, that would allow an extension to do what you want, right? A
> planner hook could just walk the tree and edit those values for
> individual nodes, and the executor would enforce them.

Yes, exactly!

* The Path would store "nbytes" (= the optimizer's estimate of how
much working memory a given Path will use), to allow for future
optimizer logic to consider memory usage when choosing the best Path.

* The Plan would store a copy of "nbytes," along with "work_mem," and
the executor would enforce work_mem. A "(work_mem on)" option to the
"EXPLAIN" command would display both "nbytes" and "work_mem", per Plan
node.

* Either built-in logic or an extensibility hook would set "work_mem"
on each individual Plan node, based on whatever heuristic or rule it
chooses.

Right now, my prototype sets "work_mem" inside ExecInitNode().

Thanks,
James

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-11 22:04:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2025-02-11 at 10:39 -0800, James Hunter wrote:
> * The Path would store "nbytes" (= the optimizer's estimate of how
> much working memory a given Path will use), to allow for future
> optimizer logic to consider memory usage when choosing the best Path.
>
> * The Plan would store a copy of "nbytes," along with "work_mem," and
> the executor would enforce work_mem. A "(work_mem on)" option to the
> "EXPLAIN" command would display both "nbytes" and "work_mem", per
> Plan
> node.

Storing work_mem in each Plan node, and using that to enforce the
memory limit (rather than using the GUC directly), seems
uncontroversial to me. I'd suggest a standalone patch.

Storing the optimizer's estimate of the memory wanted also sounds like
a good idea. Let's pick a better name than "nbytes" though; maybe
"requested_mem" or something? This change would make it a lot easier
for an extension to adjust the per-node-work_mem, and also seems like
good infrastructure for anything we build into the planner later. I
suggest a standalone patch for this, as well.

Can you write a useful extension with just the above two core patches?

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-24 20:32:32
Message-ID:	CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s+i08sKDL5-mA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 11, 2025 at 2:04 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
...
> Storing work_mem in each Plan node, and using that to enforce the
> memory limit (rather than using the GUC directly), seems
> uncontroversial to me. I'd suggest a standalone patch.

I will submit a patch for this, thanks. (This will be "Patch 3".)

> Storing the optimizer's estimate of the memory wanted also sounds like
> a good idea. Let's pick a better name than "nbytes" though; maybe
> "requested_mem" or something? This change would make it a lot easier
> for an extension to adjust the per-node-work_mem, and also seems like
> good infrastructure for anything we build into the planner later. I
> suggest a standalone patch for this, as well.

I will submit a patch for this as well. (This will be "Patch 1".) I
went with "workmem" instead of "nbytes," for the estimate; and
"workmem_limit" for the limit. By omitting the underscore character
between "work" and "mem", this makes it a bit easier to distinguish
between the "work_mem" GUC, the "workmem" estimate, and the
"workmem_limit" limit.

> Can you write a useful extension with just the above two core patches?

I think so; I will attach a patch for that as well.. (This will be
"Patch 4"; note that "Patch 2" is a prerequisite for "Patch 3".)

> Regards,
> Jeff Davis

Thanks,
James Hunter

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-24 20:46:23
Message-ID:	CAJVSvF5S89XNaH1Pg40DffXD3HEFKxEe8Rew3HviyymOYp5X+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

* Patch 1 captures the optimizer’s estimate of how much working memory
a particular Plan node would need, to avoid spilling, and stores this
on the Plan, next to cost, etc. It also adds a new “work_mem on”
option to the EXPLAIN command, to display this working-memory
estimate. This “work_mem on” estimate gives the customer a sense of
how much working memory a particular query will actually use, and also
enables an extension (e.g., Patch 4), to assign working-memory limits,
per exec node, intelligently.

Patch 1 doesn't change any user-visible behavior, except for
displaying workmem estimates via EXPLAIN, when the new "work_mem on"
option is specified.

* Patch 2 is a prerequisite for Patches 3 and 4. It maintains a
subPlan list on the Plan node, next to the existing initPlan list, to
store (pointers to) regular SubPlans.

The existing initPlan list is needed, because otherwise there’s no way
to find the particular SubPlan; but this new subPlan list hasn’t been
needed before now, because every SubPlan on the list appears somewhere
inside the Plan node’s expressions. The subPlan list is needed now,
however, because a SubPlan can use working memory (if it maintains one
or two hash tables). So, we need a way to find this SubPlan, so we can
set its working-memory limit; and it doesn’t make sense to walk
through all the Plan node’s expressions, a second time, after we’ve
finalized the plan.

Instead, Patch 2 copies regular SubPlans to this new list, inside
setrefs.c, so we can find them and assign working memory to them,
later.
Patch 3 modifies all existing exec nodes to read their working-memory
limit off their Plan, rather than off the GUC. It adds a new function,
ExecAssignWorkMem(), which gets called from InitPlan(), immediately
before we start calling ExecInitNode(). This way, the executor could
assign different working-memory limits, based on global memory
conditions; but this patch just preserves existing behavior, and
copies these limits from the GUCs.

Patch 2 doesn't change any user-visible behavior -- it just adds some
internal bookkeeping.

* Patch 3 extends the new “work_mem on” EXPLAIN option, further, to
show the working-memory limit. This is the limit already imposed by
PostgreSQL's work_mem and hash_mem_multiplier GUCs. Patch 3 copies
this limit from these GUCs, onto a new field stored on the Plan
object. It then modifies "EXPLAIN (work_mem on)" to read this limit
off the Plan object and display it.

Other than this change to EXPLAIN, Patch 3 doesn't change any
user-visible behavior.

* Patch 4, finally!, adds a hook to allow extensions to override
ExecAssignWorkMem(). It also adds an extension, “workmem,” that
implements this hook and assigns working memory to individual
execution nodes, based on a new workmem.query_work_mem GUC. This
extension prevents queries from exceeding workmem.query_work_mem,
while also handing out extra memory in the case where the query limit,
from Patch 3, is < workmem.query_work_mem.

In this way, Patch 4 avoids either undersubscribing or oversubscribing
working memory for queries, which is the goal of my proposal.

Discussion follows--

A few operators currently do not honor their working-memory limits by
spilling; these operators use tuple hash tables — which don’t spill —
without implementing their own “spill” logic. I would address these
operators in a subsequent release. Note that Hash Agg and Hash Join
both spill, as expected, so the major cases already work.

I store the working-memory estimate on both Path and Plan objects.
Keeping with PostgreSQL convention that a Path is an abstraction of
one or more Plan nodes, the Path’s working-memory estimate is “total,”
while the Plan’s is “per data structure.” So, if a SQL operator
requires 2 sort buffers, the Path’s working-memory estimate will be 2x
the Plan’s.

The Plan’s estimate is “per data structure,” because it will be used
to determine the data structure’s working-memory limit. Note that
every operator (save one) currently treats work_mem [*
hash_mem_multiplier] as a per-structure limit, rather than a
per-operator limit. (The exception is Hash Agg, where all of the
node’s hash tables share the same memory limit; and so I have
preserved this behavior in the Hash Agg’s workmem and workmem_limit
fields.)

The Plan’s workmem estimate logically belongs on the Plan object (just
as the Path’s workmem logically belongs on the Path), while the
workmem_limit logically belongs on the PlanState. This is why
workmem_limit is set inside InitPlan() — it’s an execution-time limit,
not a plan-time limit.

However, the workmem_limit is stored, physically, on the Plan object,
not the PlanState. This is to avoid a chicken-and-egg problem: (a) The
PlanState is not created until ExecInitNode(); but, (b) ExecInitNode()
also typically creates the node’s data structures, sized to
workmem_limit.

So we need a way to set workmem_limit after the Plan has been
finalized, but before any exec nodes are initialized. Accordingly, we
set this field on the Plan object, with the understanding that it
doesn’t “really” belong there.

A nice consequence of storing workmem_limit on the Plan object, rather
than the PlanState, is that the limit automatically gets
serialized+deserialized to parallel workers. This simplifies Patch 3 a
little bit, since we can avoid executing ExecWorkMem() on parallel
workers; but it really benefits Patch 4, because it allows the
ExecAssignWorkMem_hook to set a memory limit on the query, regardless
of the number of parallel workers that get spawned at runtime.

Notes about individual SQL operators follow--

Patch 1 reuses existing optimizer logic, as much as possible, to
calculate “workmem” — rounded up to the nearest KB, and with a minimum
of 64 KB. (The 64 KB minimum is because that’s the smallest a customer
can set the work_mem GUC, so it seems possible that some SQL operators
rely on the assumption that they’ll always get >= 64 KB of working
memory.)

The PostgreSQL operators that use working memory can be placed into
three categories:

1. Operators that use working memory, and which also cost the
possibility of spilling. For these operators, Patch 1 just reports the
“nbytes” estimate that the optimizer already produces.
1a. Sort and IncrementalSort (via cost_tuplesort()).
1b. HashJoin (via ExecChooseHashTableSize()).
1c. Material (via cost_material()).
1d. Unique (via either cost_sort() or cost_agg()).
1e. Grouping Sets (via create_groupingsets_path(), which calls
cost_sort() and cost_agg()).

NOTE: Grouping Sets can end up creating a chain of Agg plan nodes,
each of which gets its own working-memory budget. Discussed below.

1f. Agg (via cost_agg()).

NOTE: Discussed below.

1g. SetOp (via create_setop_path()).

NOTE: A SetOp never spills; however, existing logic disables the SetOp
“if it doesn't look like the hashtable will fit into hash_mem.” It
assumes the hash entry size is: MAXALIGN(leftpath->pathtarget->width)
+ MAXALIGN(SizeofMinimalTupleHeader).

2. Operators that use working memory, but which do not currently cost
the possibility of spilling, because the existing estimate is assumed
to be unreliable. For these operators, Patch 1 just reports an
“unreliable” estimate.
2a. FunctionScan .
2b. TableFuncScan .

3. Remaining operators that use working memory, but for whatever
reason do not currently cost the possibility of spilling. For these
operators, Patch 1 just computes and reports an estimate, based on
logic appearing elsewhere in the code.

3a. RecursiveUnion. (Uses two Tuplestores, and possibly a
TupleHashTable.) Patch 1 uses nrterm to estimate one of the
Tuplestores; rterm to estimate the second Tuplestore; and (if
relevant) numGroups to estimate # of hash buckets.
3b. CteScan (but *not* WorkTableScan). Relies on cost_ctescan().)
Patch 1 just uses rows * width, since the output is materialized into
a Tuplestore.
3c. Memoize . Patch 1 uses ndistinct to estimate # of hash buckets.
3d. WindowAgg . Patch 1 uses startup_tuples to estimate # of tuples
materialized in the Tuplestore.
3e. BitmapHeapScan. Although the TID bitmaps created by the
bitmapqual’s BitmapIndexScan nodes are limited to work_mem, these
bitmaps lossify rather than spill. Patch 1 applies the inverse of
tbm_calculate_entries() to the expected number of heap rows, produced
by the optimizer.
3f. SubPlan, if it requires a hash table (and possibly a hash-NULL
table). Patch 1 uses rows and rows / 16, respectively, copying the
existing logic in nodeSubplan.c and subselect.c.

NOTE: Since we don’t display SubPlans directly, in EXPLAIN, Patch 1
includes this working-memory estimate along with the SubPlan’s parent
Plan node.

Final comments --

I think the attached patch-set is useful, by itself; but it also
serves as a necessary building block for future work to manage query
working-memory dynamically. For example, the optimizer could be
enhanced to trade off between a high-memory + low cost plan, and a
low-memory + high cost plan. The execution-time extension could be
enhanced to adjust its query-working-memory limit based on current,
global memory usage.

And individual exec nodes could be enhanced to request additional
working-memory, via hook, if they discover they need to spill
unexpectedly. (For example, this would be useful for serial Hash
Joins.)

Question / comments / suggestions / issues / complaints?
Thanks,
James Hunter

[1] https://www.postgresql.org/message-id/flat/CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s%2Bi08sKDL5-mA%40mail.gmail.com#099d95841bcf533ccdd51f3708bcebc7

Attachment	Content-Type	Size
v01_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patch	application/octet-stream	93.5 KB
v01_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patch	application/octet-stream	36.2 KB
v01_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patch	application/octet-stream	68.4 KB
v01_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	54.3 KB

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-24 21:46:49
Message-ID:	CAJVSvF63zkNJPgT=5SygMtP7dQ5ANFm0C0mhXvLr0Rh-ihtOjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2025 at 12:46 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> Attached please find the patch set I mentioned, above, in [1]. It
> consists of 4 patches that serve as the building blocks for and a
> prototype of the "query_work_mem" GUC I proposed:

Only change in revision 2 is to Patch 3: adding 'execWorkmem.c' to
meson.build. As I use gcc "Makefile" on my dev machine, I did not
notice this omission until CFBot complained.

I bumped rev numbers on all other patches, even though they have not
changed, because I am unfamiliar with CFBot and am trying not to
confuse it (to minimize unnecessary email churn...)

Anyway, the patch set Works On My PC, and with any luck it will work
on CFBot as well now.

James

Attachment	Content-Type	Size
v02_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patch	application/octet-stream	93.5 KB
v02_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patch	application/octet-stream	36.2 KB
v02_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patch	application/octet-stream	68.8 KB
v02_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	54.3 KB

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-24 22:07:35
Message-ID:	CAJVSvF4_Q7d+-UPTGsWTevf0K3YE0ktk7hryxFzF+mZgsRP4GA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2025 at 1:46 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> On Mon, Feb 24, 2025 at 12:46 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
> >
> > Attached please find the patch set I mentioned, above, in [1]. It
> > consists of 4 patches that serve as the building blocks for and a
> > prototype of the "query_work_mem" GUC I proposed:
>
> Only change in revision 2 is to Patch 3: adding 'execWorkmem.c' to
> meson.build. As I use gcc "Makefile" on my dev machine, I did not
> notice this omission until CFBot complained.
>
> I bumped rev numbers on all other patches, even though they have not
> changed, because I am unfamiliar with CFBot and am trying not to
> confuse it (to minimize unnecessary email churn...)
>
> Anyway, the patch set Works On My PC, and with any luck it will work
> on CFBot as well now.

Apologies for email churn. The attached patch set, "v03," Works On My
PC. Only change from "v02" is correcting a missing #include in my new
extension, in Patch 4. (Patches 1-3 remain unchanged from v02.)

James

Attachment	Content-Type	Size
v03_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patch	application/octet-stream	93.5 KB
v03_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patch	application/octet-stream	36.2 KB
v03_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patch	application/octet-stream	68.8 KB
v03_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	54.3 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-25 02:54:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:
> Attached please find the patch set I mentioned, above, in [1]. It
> consists of 4 patches that serve as the building blocks for and a
> prototype of the "query_work_mem" GUC I proposed:

I didn't look at the details yet. But from:

https://www.postgresql.org/message-id/CAJVSvF7x_DLj7-JrXvMB4_j%2BjzuvjG_7iXNjx5KmLBTXHPNdGA%40mail.gmail.com

I expected something much smaller in scope, where we just add a
"plan_work_mem" field to the Plan struct, copy the work_mem global GUC
to that field when we construct a Plan node, and then reference the
plan_work_mem instead of the GUC directly.

Can you give a bit more context about why we need so many changes,
including test changes?

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-25 05:55:38
Message-ID:	CAJVSvF4KD-jBZxXFsWEZH9XL4bEUXXDjQ==VvzCKAxrNBevhHg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2025 at 6:54 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:
> > Attached please find the patch set I mentioned, above, in [1]. It
> > consists of 4 patches that serve as the building blocks for and a
> > prototype of the "query_work_mem" GUC I proposed:
>
> I didn't look at the details yet. But from:
>
> https://www.postgresql.org/message-id/CAJVSvF7x_DLj7-JrXvMB4_j%2BjzuvjG_7iXNjx5KmLBTXHPNdGA%40mail.gmail.com
>
> I expected something much smaller in scope, where we just add a
> "plan_work_mem" field to the Plan struct, copy the work_mem global GUC
> to that field when we construct a Plan node, and then reference the
> plan_work_mem instead of the GUC directly.

What you describe is basically Patch 3: it copies the work_mem and/or
work_mem * hash_mem_multiplier global GUCs onto a "workmem_limit"
field on the Plan struct, and then references that field instead of
the GUC.

Patch 3 basically consists of a new file that copies these GUCs to the
new field, along with small changes to all relevant execution nodes to
reach that field instead of the global GUC. Excluding test changes, it
adds 281 lines to new file "execWorkmem.c", and modifies 263 other
lines across 29 other files; most files have < 5 lines modified.

> Can you give a bit more context about why we need so many changes,
> including test changes?

So Patch 3 is what you describe, above. By itself, this does very
little, so Patch 4 serves as a PoC or demo showing how a cloud service
provider might use Patch 3's framework to provide better memory
management.

I don't think Patch 4 needs to go into core PostgreSQL, but I find it
helpful in demonstrating how the "workmem" framework could be used. It
adds a hook to allow an extension to override the "copy" function
added in Patch 3. The hook stuff itself is pretty small. And then, to
show how a useful extension could be written using that hook, Patch 4
includes a basic extension that implements the hook.

However, the ability to override a field via hook is useful only if
that hook has enough information to make an intelligent decision! So,
we need Patch 1, which just copies the existing "workmem" *estimates*,
from existing planner logic, onto a second Path / Plan field, this one
just called "workmem". It could be renamed "workmem_estimate," or
anything else -- the important thing is that this field is on the
Plan, so the hook can look at it when deciding how much working memory
to assign to that Plan node. The "workmem" field added by Patch 1 is
analogous to the "cost" field the planner already exposes.

Patch 1 adds ~200 lines to explain.c, to display workmem stats; and
modifies around 600 lines in costsize.c and createplan.c, to copy the
existing "workmem" estimate onto the Path and Plan fields. I could
omit populating the Path field, and copy only to the Plan field, but
it seemed like a good time to fill in the Path as well, in case future
logic wants to make use of it. (None of the other 3 patches use the
Path's "workmem" field.) So, Patch 1 is around 900 lines of code,
total, but none of the changes are very serious, since they just copy
existing estimates onto a field.

So that's Patches 1, 3, and 4; and Patch 2 is just some local
infrastructure so that Patches 3 and 4 can find all the query's
SubPlans, so they can assign working memory to them.

In summary:

* Patch 1 copies the planner's "workmem" *estimate* to a new field on the Plan;
* Patch 2 keeps track of SubPlans, so Patches 3 and 4 can assign
workmem to them;
* Patch 3 copies the "work_mem" *limit* GUC to a new field on the Plan; and
* Patch 4 is a demo / PoC / example of how an extension can override
the Plan's "workmem_limit" field, so it doesn't need to go into core
PostgreSQL.

We don't need test changes, but the code is pretty easy to test, so,
for completeness, I added a new unit test, which is basically just a
collection of queries copied from existing unit tests, displayed using
the new "EXPLAIN (work_mem on)" option. And the "test changes" in
Patch 3 just add "work mem limit" to the output -- it's the same test
as Patch 1, just showing more information thanks to Patch 3.

Overall, the changes made to PostgreSQL are minimal, but they are
spread across multiple files... because (a) costing is spread across
multiple files (costsize.c, pathnode.c, and createplan.c -- and also
subselect.c, etc.); and (b) query execution is spread across multiple
files (e.g., nodeSort.c, nodeHash.c, etc.).

Every time we cost a Path or Plan that uses workmem, Patch 1 needs to
copy the Path/Plan's workmem estimate onto the new field. And every
time an exec node uses workmem, Patch 3 needs to read the workmem
limit off the new field. So looking at the number of files touched
overstates the size of the patches.

Thanks,
James

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-26 21:27:17
Message-ID:	CAJVSvF6ckAvcFkCT-cWAqYpVn2MM5zOEiTT-ubBB4mjFfCkAWQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2025 at 9:55 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> On Mon, Feb 24, 2025 at 6:54 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > On Mon, 2025-02-24 at 12:46 -0800, James Hunter wrote:
> > > Attached please find the patch set I mentioned, above, in [1]. It
> > > consists of 4 patches that serve as the building blocks for and a
> > > prototype of the "query_work_mem" GUC I proposed:
> >
> > I didn't look at the details yet. But from:
> >
> > https://www.postgresql.org/message-id/CAJVSvF7x_DLj7-JrXvMB4_j%2BjzuvjG_7iXNjx5KmLBTXHPNdGA%40mail.gmail.com
> >
> > I expected something much smaller in scope, where we just add a
> > "plan_work_mem" field to the Plan struct, copy the work_mem global GUC
> > to that field when we construct a Plan node, and then reference the
> > plan_work_mem instead of the GUC directly.

Attaching a new refactoring, which splits the code changes into
patches by functionality. This refactoring yields 5 patches, each of
which is relatively localized. I hope that the result will be more
focused and more feasible to review.

* Patch 1: modifies file setrefs.c, to track all (regular) SubPlan
objects that occur inside of Plan node (qual) expressions, on a new
Plan.subPlan list (parallel to the existing Plan.initPlan list, which
is for SubPlans that have been turned into init plans).

[Patch 1 has no visible side effects, since it just populates a list
on the Plan object.]

* Patch 2: copies the work_mem [* hash_mem_multiplier] GUC(s) to a new
Plan field, Plan.workmem_limit; and modifies existing exec nodes to
read the limit from this field instead of the GUCs. Adds a new file,
"execWorkmem.c", that does the GUC copying, and modifies existing exec
nodes to read the new field(s).

[Patch 2 has no visible side effects, since it just refactors code, to
store the GUCs on a field and then read those fields instead of the
GUCs.]

* Patch 3: stores the optimizer's estimate of how much working memory
a given Path / Plan node will use, on the Path / Plan, in a new field,
"workmem". (I used "workmem" for the estimate, vs. "workmem_limit," in
Patch 2, for the limit. This is to try to be parallel with the
existing "rows" and "cost" estimates.) Involves a significant amount
of code in costsize.c and createplan.c, because sometimes this
estimate is not readily available.

(What I mean is: while Patch 2 just reads the workmem_limit from a
GUC, Patch 3 has to estimate the actual workmem by basically
multiplying (width * rows). But not all Paths / Plans cost the
possibility of spilling, so sometimes I have to copy this formula from
the corresponding exec node, etc. The logical changes in Patch 3 are
simple, but the physical LoC is larger.)

[Patch 3 has no visible side effect, since it just stores an estimate
on the Plan object.]

* Patch 4: modifies file explain.c to implement a "work_mem on" option
to the EXPLAIN command. Also adds a unit test that shows that this
"work_mem on" option works as expected.

[Patch 4 is pure visible side effect -- all it does it add a new
option to display workmem stats to the customer. But it doesn't change
any existing behavior: it just adds a new EXPLAIN option.]

* Patch 5: adds a sample extension / hook that shows how Patches 2 and
3 can be used -- without much effort! -- to implement a per-query
working-memory limit, that gives more working memory to exec nodes
that (are estimated to) need it, while taking working memory away, if
necessary, from exec nodes that (we estimate) don't need it.

The refactored patch set should be more feasible to review, since each
patch is now localized to a single piece of functionality.

Note that Patch 5 isn't essential to merge into core PostgreSQL, since
it's mostly a proof-of-concept for how a "work_mem.query_work_mem" GUC
could be implemented. But Patches 2 and 3 are needed, since they
expose the limit and estimate, on the Plan, on which Patch 5 (or any
similar working-memory extension) relies.

Thanks again,
James Hunter

Attachment	Content-Type	Size
0001-Store-non-init-plan-SubPlan-objects-in-Plan-list.patch	application/octet-stream	36.2 KB
0002-Store-working-memory-limit-on-Plan-field-rather-than.patch	application/octet-stream	43.0 KB
0003-Add-workmem-estimate-to-Path-and-Plan-nodes.patch	application/octet-stream	52.8 KB
0004-Add-EXPLAIN-work_mem-on-command-option.patch	application/octet-stream	48.4 KB
0005-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	54.2 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-27 00:08:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2025-02-26 at 13:27 -0800, James Hunter wrote:
> Attaching a new refactoring, which splits the code changes into
> patches by functionality. This refactoring yields 5 patches, each of
> which is relatively localized. I hope that the result will be more
> focused and more feasible to review.

Thank you, yes, this is helpful.

Taking a step back:

My idea was that we'd attach work_mem to each Path node and Plan node
at create time. For example, in create_agg_path() it could do:

pathnode->path.node_work_mem = work_mem;

And then add to copy_generic_path_info():

dest->node_work_mem = src->node_work_mem;

(and similarly for other nodes; at least those that care about
work_mem)

Then, at execution time, use node->ps.ss.plan.node_work_mem instead of
work_mem.

Similarly, we could track the node_mem_wanted, which would be the
estimated amount of memory the node would use if unlimited memory were
available. I believe that's already known (or a simple calculation) at
costing time, and we can propagate it from the Path to the Plan the
same way.

(A variant of this approach could carry the values into the PlanState
nodes as well, and the executor would that value instead.)

Extensions like yours could have a GUC like ext.query_work_mem and use
planner_hook to modify plan after standard_planner is done, walking the
tree and adjusting each Plan node's node_work_mem to obey
ext.query_work_mem. Another extension might hook in at path generation
time with set_rel_pathlist_hook or set_join_pathlist_hook to create
paths with lower node_work_mem settings that total up to
ext.query_work_mem. Either way, the node_work_mem settings would end up
being enforced by the executor by tracking memory usage and spilling
when it exceeds node->ps.ss.plan.node_work_mem.

Your patch 0001 looks like it makes it easier to find all the relevant
Plan nodes, so that seems like a reasonable step (didn't look at the
details).

But your patch 0002 and 0003 are entirely different from what I
expected. I just don't understand why we need anything as complex or
specific as ExecAssignWorkMem(). If we just add it at the time the Path
is created, and then propagate it to the plan with
copy_generic_path_info(), that would be a lot less code. What am I
missing?

Regards,
Jeff Davis

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-27 18:01:17
Message-ID:	CAJVSvF6bDfS1_MoqH+RNHFXTo4ZNkcHLj+fgZ+C0RSL2sVeQzQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 26, 2025 at 4:09 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> My idea was that we'd attach work_mem to each Path node and Plan node
> at create time. For example, in create_agg_path() it could do:
>
> pathnode->path.node_work_mem = work_mem;
>
> And then add to copy_generic_path_info():
>
> dest->node_work_mem = src->node_work_mem;
>
> (and similarly for other nodes; at least those that care about
> work_mem)
>
> Then, at execution time, use node->ps.ss.plan.node_work_mem instead of
> work_mem.

This is essentially what patches 2 and 3 do. Some comments:

First, there's no need to set the workmem_limit at Path time, since
it's not needed until the Plan is init-ted into a PlanState. So I set
it on the Plan but not on the Path.

Second, the logic to assign a workmem_limit to the Agg node is a bit
more complicated than in your example, because the Agg could be either
a Hash or a Sort. If it's a Hash, it gets work_mem * hash_mem_limit;
and if it's a Sort it gets either 0 or work_mem.

We can adjust the logic so that it gets work_mem instead of 0, by
pushing the complexity out of the original workmem_limit assignment
and into later code blocks -- e.g., in an extension -- but we still
need to decide whether the Agg is a Hash or a Sort. This is why Patch
2 does:

switch (agg->aggstrategy)
{
case AGG_HASHED:
case AGG_MIXED:

/*
* Because nodeAgg.c will combine all AGG_HASHED nodes into a
* single phase, it's easier to store the hash working-memory
* limit on the first AGG_{HASHED,MIXED} node, and set it to zero
* for all subsequent AGG_HASHED nodes.
*/
agg->plan.workmem_limit = is_first ?
get_hash_memory_limit() / 1024 : 0;
break;
case AGG_SORTED:

/*
* Also store the sort-output working-memory limit on the first
* AGG_SORTED node, and set it to zero for all subsequent
* AGG_SORTED nodes.
*
* We'll need working-memory to hold the "sort_out" only if this
* isn't the last Agg node (in which case there's no one to sort
* our output).
*/
agg->plan.workmem_limit = *is_first_sort && !is_last ?
work_mem : 0;

*is_first_sort = false;
break;
default:
break;
}

Notice that the logic also sets the limit to 0 on certain Agg nodes --
this can be avoided, at the cost of added complexity later. The added
complexity is because, for example, for Hash Aggs, all Hash tables
share the same overall workmem_limit. So any workmem_limit set on
subsequent hash Agg nodes would be ignored, which means that setting
that limit adds conmplexity by obscuring the underlying logic.

> Similarly, we could track the node_mem_wanted, which would be the
> estimated amount of memory the node would use if unlimited memory were
> available. I believe that's already known (or a simple calculation) at
> costing time, and we can propagate it from the Path to the Plan the
> same way.

And this is what Patch 3 does. As you point out, this is already
known, or, if not, it's a simple calculation. This is all that Patch 3
does.

> (A variant of this approach could carry the values into the PlanState
> nodes as well, and the executor would that value instead.)

That's not needed, though, and would violate existing PG conventions:
we don't copy anything from Plan to PlanState, because the assumption
is that the PlanState always has access to its corresponding Plan.
(The reason we copy from Path to Plan, I believe, is that we drop all
Paths, to save memory; because we generally have many more Paths than
Plans.)

> Extensions like yours could have a GUC like ext.query_work_mem and use
> planner_hook to modify plan after standard_planner is done, walking the
> tree and adjusting each Plan node's node_work_mem to obey
> ext.query_work_mem. Another extension might hook in at path generation
> time with set_rel_pathlist_hook or set_join_pathlist_hook to create
> paths with lower node_work_mem settings that total up to
> ext.query_work_mem.

I don't sent workmem_limit at Path time, because it's not needed
there; but I do set the workmem (estimate) at Path time, exactly so
that future optimizer hooks could make use of a Path's workmem
(estimate) to decide between different Paths.

Patch 3 sets workmem (estimate) on the Path and copies it to the Plan.
As you know, there's deliberately not a 1-1 correspondence between
Path and Plan (the way there is generally a 1-1 correspondence between
Plan and PlanState), so Patch 3 has to do some additional work to
propagate the workmem (estimate) from Path to Plan. You can see
existing examples of similar work inside file createplan.c. Creating a
Plan from a Path is not generally as simple as just copying the Path's
fields over; there are lots of special cases.

Although Patch 3 sets workmem (estimate) on the Plan, inside
createplan.c, Patch 2 doesn't set workmem_limit inside createplan.c.
An earlier draft of the patchset *did* set it there, but because of
all the special casing in createplan.c, this ended up becoming
difficult to understand and maintain.

> Either way, the node_work_mem settings would end up
> being enforced by the executor by tracking memory usage and spilling
> when it exceeds node->ps.ss.plan.node_work_mem.

More precisely: the settings are enforced by the executor, by having
each PlanState's ExecInitNode() override refer to the
Plan.workmem_limit field, rather the corresponding GUC(s). This means
that the final workmem_lmit needs to be set before ExecInitNode() is
called.

> But your patch 0002 and 0003 are entirely different from what I
> expected. I just don't understand why we need anything as complex or
> specific as ExecAssignWorkMem(). If we just add it at the time the Path
> is created, and then propagate it to the plan with
> copy_generic_path_info(), that would be a lot less code. What am I
> missing?

Patches 2 and 3 are as you described above. I have been trying to
understand what you mean by "a lot less code," and I think two things
about these patches stand out to you:

1. Patch 2 performs its own Plan tree traversal, in
ExecAssignWorkMem(), instead of relying on the existing traversal in
function create_plan(). I outlined the reasons for this decision
above. Because of the point immediately below this, embedding Patch
2's logic into create_plan() ended up making the code much harder to
follow, so I broke out the traversal into its own (very simple)
ExecAssignWorkMem() function.

2. Patches 2 and 3 necessarily contain logic for various special
cases, where the workmem (estimate) and workmem_limit are not as
simple as in your example above. (But your example is actually not as
simple as you make it out to be, as discussed above and below...)

To understand (2), it helps to have a sense for how, for example, file
createplan.c has already been extended to handle special cases. We
already have functions like copy_plan_costsize(),
make_unique_from_sortclauses(), make_sort_from_groupcols(), etc. Just
from the function names, it's clear that we routinely generate new
Plan nodes that don't have corresponding Paths. Existing function
create_groupingsets_plan() is 150 Lines of Code, because it turns a
single GroupingSetsPath into a chain of Agg nodes. And, of course,
there's the whole logic around turning AlternativeSubPlans into
SubPlans, inside file setrefs.c!

So, generally, Patches 2 and 3 do exactly what you expect, but they
handle (existing) special cases, which ends up requiring more code. If
PostgreSQL didn't have these special cases, I think the patches would
be as short as you expect. For example, if Agg nodes behaved as in
your example, quoted at the top of this email, then we wouldn't need
Patch 2's additional logic to assign workmem_limit to Aggs (and we
wouldn't need the corresponding logic in Patch 3, to assign workmem
(estimate) to Aggs, either).

But Aggs aren't as simple as in your example -- they have Hash limits
and Sort limits; they have a side-chain of Agg nodes; they have input
sets they need to Sort; etc. And so we need a couple dozen lines of
code to handle them.

Thanks for the feedback,
James Hunter

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-03-05 01:47:16
Message-ID:	CAJVSvF5n3_uEGW5GZSRehDuTfz7XVDohbn7tVJ+2ZnweQEVFrQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attaching a new revision, which substantially reworks the previous revision --

For the previous revision, I ran into problems (exposed by CI tests)
when trying to get my "subPlan" list to work, because this means we
have two pointers into a single SubPlan, which breaks both
serialization and copyObject().

This led to a new approach. The former Patch 1 is no longer needed,
because that "subPlan" logic never worked anyway.

Now, I store the workmem info, in Lists, first on the PlannerGlobal,
then transferred to the PlannedStmt. Every [Sub]Plan that needs
working memory now gets a "workmem_id" index into these Lists. Since
it's just an index, it survives serialization and copyObject().

So, now the workmem info can now be successfully roundtripped. It also
makes it easier (and faster) for an extension to adjust workmem limits
for an entire query, since all of the query's workmem info is
available directly from the PlannedStmt -- without requiring us to
traverse the Plan + Expr trees. (My example hook/extension dropped by
a couple hundred LoC, since the previous revision, because now it can
just loop over a List, instead of needing to walk a Plan tree.)

So, now we have:

- Patch 1: adds a workmem limit to the PlannerGlobal, inside
createplan.c, and stores the corresponding workmem_id on the Plan or
SubPlan. The List is copied from the PlannerGlobal to the PlannedStmt,
as normal. We trivially set the workmem limit inside
ExecAssignWorkMem(), called from InitPlan.

This patch is a no-op, since it just copies existing GUC values to the
workmem limit, and then applies that limit inside ExecInitNode().

- Patch 2: copies the planner's workmem estimate to the PlannerGlobal
/ PlannedStmt, to allow an extension to set the workmem limit
intelligently (without needing to traverse to the Plan or SubPlan).

This patch is a no-op, since it just records an estimate on the
PlannerGlobal / PlannedStmt, but doesn't do anything with it (yet).

- Patch 3: displays the workmem info we set in Patches 1 and 2, to a
new EXPLAIN (work_mem on) option. Also adds a unit test.

- Patch 4: adds a hook and extension that show how to override the
default workmem limits, to implement a query_work_mem GUC.

I think this version is pretty close to a finished design proposal:

* top-level list(s) of workmem info;
* Plans and SubPlans that need workmem "registering" themselves
during createplan.c;
* exec nodes reading their workmem limits from the PlannedStmt, via
plan->workmem_id (or variants, in cases where a [Sub]Plan has multiple
data structures of *different* sizes);
* InitPlan() calls a function or hook to fill in the actual workmem limits;
* Workmem info copied / serialized to PQ workers, and stored in Plan
cache (but the limit is always overwritten inside InitPlan()); and
* Hook / extension reads the workmem info and sets a sensible limit,
based on its own heuristic.

Patch 4 shows that we can pretty easily (400 lines, including
comments) propagate a per-query workmem limit to individual
[Sub]Plans' data structures, in a reasonable way.

Compared to the previous revision, this patch set:
- eliminates the Plan traversal in execWorkMem.c and workmem.c;
- removes the "SubPlan" logic from setrefs.c, leaving setrefs unchanged; and
- sets the estimate and reserves a slot for the limit, inside createplan.c.

So, now, the logic to assign workmem limits is just a for- loop in
execWorkMem.c; and it's just 2 for- loops + 1 sort, in the workmem
extension.

Questions, comments?

Thanks,
James

Attachment	Content-Type	Size
0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patch	application/octet-stream	54.1 KB
0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patch	application/octet-stream	59.2 KB
0003-Add-EXPLAIN-work_mem-on-command-option.patch	application/octet-stream	47.5 KB
0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	47.7 KB

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-04-07 22:36:59
Message-ID:	CAJVSvF7HaSz=-b1g5BCUML17=SdjbYV+pSFHH9WuRfRCGScGRw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 4, 2025 at 5:47 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> Attaching a new revision, which substantially reworks the previous revision --
>

Attaching a rebased revision, with some minor changes.

Also, some context for why this change is especially useful for cloud
variants of PostgreSQL -- if you compare PostgreSQL guidance for
buffer pool size [1] to Amazon Aurora's [2], PostgreSQL recommends the
buffer pool to be sized to 25% of system memory, while Aurora
recommends it to be sized to ~ 70%. PostgreSQL explicitly relies on
the OS filesystem cache, effectively to extend the buffer pool; while
Aurora docs don't mention this at all.

Accordingly, Aurora PostgreSQL queries have less memory to work with
than ordinary PostgreSQL queries, making per-Node memory limits more
important.

Questions, comments?

Thanks,
James

[1] https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS
[2] https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-postgresql-parameters/shared-buffers.html

Attachment	Content-Type	Size
0002-Add-workmem-estimates-to-Path-node-and-PlannedStmt.patch	application/octet-stream	60.3 KB
0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	48.4 KB
0001-Store-working-memory-limit-per-Plan-SubPlan-rather-t.patch	application/octet-stream	54.1 KB
0003-Add-EXPLAIN-work_mem-on-command-option.patch	application/octet-stream	48.0 KB