Lists: | pgsql-hackers |
---|
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-23 04:57:43 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi all,
Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).
I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.
PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------
Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:
* The value is hard-coded, so it's not quite possible to customize it.
* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.
* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.
So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.
Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).
The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.
Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.
PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------
Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.
Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.
This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.
The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.
To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.
So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).
But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).
For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.
On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.
WAL overhead
------------
Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).
I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.
By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.
For assignments, imagine a single-row table (no indexes), and a
transaction like this one:
BEGIN;
UPDATE t SET v = v + 1;
SAVEPOINT s1;
UPDATE t SET v = v + 1;
SAVEPOINT s2;
UPDATE t SET v = v + 1;
SAVEPOINT s3;
...
UPDATE t SET v = v + 1;
SAVEPOINT s10;
UPDATE t SET v = v + 1;
COMMIT;
For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.
For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.
In both cases, there was pretty much no measurable impact on performance
(as measured by tps).
I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.
All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.
Simplifying ReorderBuffer
-------------------------
One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.
This is what the apply worker does with the streamed changes and aborts.
It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gz | application/gzip | 6.8 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz | application/gzip | 10.6 KB |
0006-Add-support-for-streaming-to-built-in-replication.patch.gz | application/gzip | 13.3 KB |
From: | Erikjan Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-23 14:03:02 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2017-12-23 05:57, Tomas Vondra wrote:
> Hi all,
>
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
>
logical replication of 2 instances is OK but 3 and up fail with:
TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)
I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Erikjan Rijkers <er(at)xs4all(dot)nl> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-23 20:06:05 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
> On 2017-12-23 05:57, Tomas Vondra wrote:
>> Hi all,
>>
>> Attached is a patch series that implements two features to the logical
>> replication - ability to define a memory limit for the reorderbuffer
>> (responsible for building the decoded transactions), and ability to
>> stream large in-progress transactions (exceeding the memory limit).
>>
>
> logical replication of 2 instances is OK but 3 and up fail with:
>
> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
> "reorderbuffer.c", Line: 1773)
>
> I can cobble up a script but I hope you have enough from the assertion
> to see what's going wrong...
The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
So instructions to reproduce the issue would be very helpful.
Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gz | application/gzip | 6.8 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gz | application/gzip | 10.6 KB |
0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gz | application/gzip | 13.9 KB |
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-23 22:23:57 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2017-12-23 21:06, Tomas Vondra wrote:
> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>> Hi all,
>>>
>>> Attached is a patch series that implements two features to the
>>> logical
>>> replication - ability to define a memory limit for the reorderbuffer
>>> (responsible for building the decoded transactions), and ability to
>>> stream large in-progress transactions (exceeding the memory limit).
>>>
>>
>> logical replication of 2 instances is OK but 3 and up fail with:
>>
>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>> "reorderbuffer.c", Line: 1773)
>>
>> I can cobble up a script but I hope you have enough from the assertion
>> to see what's going wrong...
>
> The assertion says that the iterator produces changes in order that
> does
> not correlate with LSN. But I have a hard time understanding how that
> could happen, particularly because according to the line number this
> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
> case.
>
> So instructions to reproduce the issue would be very helpful.
Using:
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch
As you expected the problem is the same with these new patches.
I have now tested more, and seen that it not always fails. I guess that
it here fails 3 times out of 4. But the laptop I'm using at the moment
is old and slow -- it may well be a factor as we've seen before [1].
Attached is the bash that I put together. I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
often. This same program run with HEAD never seems to fail (I tried a
few dozen times).
thanks,
Erik Rijkers
Attachment | Content-Type | Size |
---|---|---|
test.sh | text/x-shellscript | 7.3 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-24 00:42:01 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/23/2017 11:23 PM, Erik Rijkers wrote:
> On 2017-12-23 21:06, Tomas Vondra wrote:
>> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>>> Hi all,
>>>>
>>>> Attached is a patch series that implements two features to the logical
>>>> replication - ability to define a memory limit for the reorderbuffer
>>>> (responsible for building the decoded transactions), and ability to
>>>> stream large in-progress transactions (exceeding the memory limit).
>>>>
>>>
>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>
>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>> "reorderbuffer.c", Line: 1773)
>>>
>>> I can cobble up a script but I hope you have enough from the assertion
>>> to see what's going wrong...
>>
>> The assertion says that the iterator produces changes in order that does
>> not correlate with LSN. But I have a hard time understanding how that
>> could happen, particularly because according to the line number this
>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
>>
>> So instructions to reproduce the issue would be very helpful.
>
> Using:
>
> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>
> As you expected the problem is the same with these new patches.
>
> I have now tested more, and seen that it not always fails. I guess that
> it here fails 3 times out of 4. But the laptop I'm using at the moment
> is old and slow -- it may well be a factor as we've seen before [1].
>
> Attached is the bash that I put together. I tested with
> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
> often. This same program run with HEAD never seems to fail (I tried a
> few dozen times).
>
Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-24 04:51:52 |
Message-ID: | CAMsr+YFbRTeX1_A+HcHP52NW5R7G8NDEvOwGzYHyXCMdD3sQkA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 23 December 2017 at 12:57, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> Hi all,
>
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
>
> I'm submitting those two changes together, because one builds on the
> other, and it's beneficial to discuss them together.
>
>
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
>
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:
>
> * The value is hard-coded, so it's not quite possible to customize it.
>
> * The amount of decoded changes to keep in memory is restricted by
> number of changes. It's not very unclear how this relates to memory
> consumption, as the change size depends on table structure, etc.
>
> * The number is "per (sub)transaction", so a transaction with many
> subtransactions may easily consume significant amount of memory without
> actually hitting the limit.
>
Also, even without subtransactions, we assemble a ReorderBufferTXN per
transaction. Since transactions usually occur concurrently, systems with
many concurrent txns can face lots of memory use.
We can't exclude tables that won't actually be replicated at the reorder
buffering phase either. So txns use memory whether or not they do anything
interesting as far as a given logical decoding session is concerned. Even
if we'll throw all the data away we must buffer and assemble it first so we
can make that decision.
Because logical decoding considers snapshots and cid increments even from
other DBs (at least when the txn makes catalog changes) the memory use can
get BIG too. I was recently working with a system that had accumulated 2GB
of snapshots ... on each slot. With 7 slots, one for each DB.
So there's lots of room for difficulty with unpredictable memory use.
So the patch does two things. Firstly, it introduces logical_work_mem, a
> GUC restricting memory consumed by all transactions currently kept in
> the reorder buffer
>
Does this consider the (currently high, IIRC) overhead of tracking
serialized changes?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-24 09:00:00 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>>>>
>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>
>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>> "reorderbuffer.c", Line: 1773)
>>>>
>>>> I can cobble up a script but I hope you have enough from the
>>>> assertion
>>>> to see what's going wrong...
>>>
>>> The assertion says that the iterator produces changes in order that
>>> does
>>> not correlate with LSN. But I have a hard time understanding how that
>>> could happen, particularly because according to the line number this
>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>> case.
>>>
>>> So instructions to reproduce the issue would be very helpful.
>>
>> Using:
>>
>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>
>> As you expected the problem is the same with these new patches.
>>
>> I have now tested more, and seen that it not always fails. I guess
>> that
>> it here fails 3 times out of 4. But the laptop I'm using at the
>> moment
>> is old and slow -- it may well be a factor as we've seen before [1].
>>
>> Attached is the bash that I put together. I tested with
>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which
>> fails
>> often. This same program run with HEAD never seems to fail (I tried a
>> few dozen times).
>>
>
> Thanks. Unfortunately I still can't reproduce the issue. I even tried
> running it in valgrind, to see if there are some memory access issues
> (which should also slow it down significantly).
One wonders again if 2ndquadrant shouldn't invest in some old hardware
;)
Another Good Thing would be if there was a provision in the buildfarm to
test patches like these.
But I'm probably not to first one to suggest that; no doubt it'll be
possible someday. In the meantime I'll try to repeat this crash on
other machines (but that will be after the holidays).
Erik Rijkers
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-24 13:43:49 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/24/2017 05:51 AM, Craig Ringer wrote:
> On 23 December 2017 at 12:57, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> Hi all,
>
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
>
> I'm submitting those two changes together, because one builds on the
> other, and it's beneficial to discuss them together.
>
>
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
>
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:
>
> * The value is hard-coded, so it's not quite possible to customize it.
>
> * The amount of decoded changes to keep in memory is restricted by
> number of changes. It's not very unclear how this relates to memory
> consumption, as the change size depends on table structure, etc.
>
> * The number is "per (sub)transaction", so a transaction with many
> subtransactions may easily consume significant amount of memory without
> actually hitting the limit.
>
>
> Also, even without subtransactions, we assemble a ReorderBufferTXN
> per transaction. Since transactions usually occur concurrently,
> systems with many concurrent txns can face lots of memory use.
>
I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).
> We can't exclude tables that won't actually be replicated at the reorder
> buffering phase either. So txns use memory whether or not they do
> anything interesting as far as a given logical decoding session is
> concerned. Even if we'll throw all the data away we must buffer and
> assemble it first so we can make that decision.
Yep.
> Because logical decoding considers snapshots and cid increments even
> from other DBs (at least when the txn makes catalog changes) the memory
> use can get BIG too. I was recently working with a system that had
> accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
> each DB.
>
> So there's lots of room for difficulty with unpredictable memory use.
>
Yep.
> So the patch does two things. Firstly, it introduces logical_work_mem, a
> GUC restricting memory consumed by all transactions currently kept in
> the reorder buffer
>
>
> Does this consider the (currently high, IIRC) overhead of tracking
> serialized changes?
>
Consider in what sense?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-25 17:40:56 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/24/2017 10:00 AM, Erik Rijkers wrote:
>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails. I guess that
>>> it here fails 3 times out of 4. But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together. I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often. This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
>
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
>
Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.
Turns out the asserts are overly strict - instead of
Assert(prev_lsn < current_lsn);
it should have been
Assert(prev_lsn <= current_lsn);
because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).
The attached v3 fixes this issue, and also a couple of other thinkos:
1) The AssertChangeLsnOrder assert check was somewhat broken.
2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.
3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.
FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gz | application/gzip | 6.8 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gz | application/gzip | 10.7 KB |
0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gz | application/gzip | 14.0 KB |
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-25 22:08:43 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
That indeed fixed the problem: running that same pgbench test, I see no
crashes anymore (on any of 3 different machines, and with several
pgbench parameters).
Thank you,
Erik Rijkers
From: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2017-12-26 14:50:45 |
Message-ID: | CA+q6zcUUpU+bj_PmxzVw81Qga1dWMJM0-bvv80e7KAWvWjd2dQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> On 25 December 2017 at 18:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> The attached v3 fixes this issue, and also a couple of other thinkos
Thank you for the patch, it looks quite interesting. After a quick look at
it
(mostly the first one so far, but I'm going to continue) I have a few
questions:
> + * XXX With many subtransactions this might be quite slow, because we'll
have
> + * to walk through all of them. There are some options how we could
improve
> + * that: (a) maintain some secondary structure with transactions sorted
by
> + * amount of changes, (b) not looking for the entirely largest
transaction,
> + * but e.g. for transaction using at least some fraction of the memory
limit,
> + * and (c) evicting multiple transactions at once, e.g. to free a given
portion
> + * of the memory limit (e.g. 50%).
Do you want to address these possible alternatives somehow in this patch or
you
want to left it outside? Maybe it makes sense to apply some combination of
them, e.g. maintain a secondary structure with relatively large
transactions,
and then start evicting them. If it's somehow not enough, then start to
evict
multiple transactions at once (option "c").
> + /*
> + * We clamp manually-set values to at least 64kB. The
maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
> +
I'm not sure what's recommended practice here, but maybe it makes sense to
have a warning here about changing this value to 64kB? Otherwise it can be
unexpected.
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-02 15:07:54 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
The documentation in this patch contains some references to later
features (streaming). Perhaps that could be separated so that the
patches can be applied independently.
I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could then
have undesirable side effects on this use.
Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem. Maybe we could think of another name.
I think we need a way to report on how much memory is actually used, so
the setting can be tuned. Something analogous to log_temp_files perhaps.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-03 19:53:59 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/02/2018 04:07 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>
> The documentation in this patch contains some references to later
> features (streaming). Perhaps that could be separated so that the
> patches can be applied independently.
>
Yeah, that's probably a good idea. But now that you mention it, I wonder
if "streaming" is really a good term. We already use it for "streaming
replication" and it may be quite confusing to use it for another feature
(particularly when it's streaming within logical streaming replication).
But I can't really think of a better name ...
> I don't see the need to tie this setting to maintenance_work_mem.
> maintenance_work_mem is often set to very large values, which could
> then have undesirable side effects on this use.
>
Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.
Pretty much any default value can have undesirable side effects.
> Moreover, the name logical_work_mem makes it sound like it's a logical
> version of work_mem. Maybe we could think of another name.
>
I won't object to a better name, of course. Any proposals?
> I think we need a way to report on how much memory is actually used,
> so the setting can be tuned. Something analogous to log_temp_files
> perhaps.
>
Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-03 20:06:59 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
attached is v4 of the patch series, with a couple of changes:
1) Fixes a bunch of bugs I discovered during stress testing.
I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.
2) Introduces columns into pg_stat_replication.
The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.
3) Two temporary bugfixes that make the patch series work.
The first one (0008) makes sure is_known_subxact is set properly for all
subtransactions, and there's a separate fix in the CF. So this will
eventually go away.
The second one (0009) fixes an issue that is specific to streaming. It
does fix the issue, but I need a bit more time to think about it before
merging it into 0005.
This does pass extensive stress testing with a workload mixing DML, DDL,
subtransactions, aborts, etc. under valgrind. I'm working on extending
the test coverage, and introducing various error conditions (e.g.
walsender/walreceiver timeouts, failures on both ends, etc.).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v4.patch.gz | application/gzip | 9.9 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v4.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-log-v4.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods-v4.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-v4.patch.gz | application/gzip | 10.8 KB |
0006-Add-support-for-streaming-to-built-in-replication-v4.patch.gz | application/gzip | 18.2 KB |
0007-Track-statistics-for-streaming-spilling-v4.patch.gz | application/gzip | 4.3 KB |
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v4.patch.gz | application/gzip | 571 bytes |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v4.patch.gz | application/gzip | 629 bytes |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-03 20:13:29 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/03/2018 09:06 PM, Tomas Vondra wrote:
> Hi,
>
> attached is v4 of the patch series, with a couple of changes:
>
> 1) Fixes a bunch of bugs I discovered during stress testing.
>
> I'm not going to go into details, but the main fixes are related to
> properly updating progress from the worker, and not streaming when
> creating the logical replication slot.
>
> 2) Introduces columns into pg_stat_replication.
>
> The new columns track various kinds of statistics (number of xacts,
> bytes, ...) about spill-to-disk/streaming. This will be useful when
> tuning the GUC memory limit.
>
> 3) Two temporary bugfixes that make the patch series work.
>
Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
allow customizing the streaming and memory limit. So you can do
CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)
and this subscription will allow streaming, and the logica_work_mem (on
provider) will be set to 1MB.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-09 02:24:55 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 1/3/18 14:53, Tomas Vondra wrote:
>> I don't see the need to tie this setting to maintenance_work_mem.
>> maintenance_work_mem is often set to very large values, which could
>> then have undesirable side effects on this use.
>
> Well, we need to pick some default value, and we can either use a fixed
> value (not sure what would be a good default) or tie it to an existing
> GUC. We only really have work_mem and maintenance_work_mem, and the
> walsender process will never use more than one such buffer. Which seems
> to be closer to maintenance_work_mem.
>
> Pretty much any default value can have undesirable side effects.
Let's just make it an independent setting unless we know any better. We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.
>> Moreover, the name logical_work_mem makes it sound like it's a logical
>> version of work_mem. Maybe we could think of another name.
>
> I won't object to a better name, of course. Any proposals?
logical_decoding_[work_]mem?
>> I think we need a way to report on how much memory is actually used,
>> so the setting can be tuned. Something analogous to log_temp_files
>> perhaps.
>
> Yes, I agree. I'm just about to submit an updated version of the patch
> series, that also introduces a few columns pg_stat_replication, tracking
> this type of stats (amount of data spilled to disk or streamed, etc.).
That seems OK. Perhaps we could bring forward the part of that patch
that applies to this feature.
That would also help testing *this* feature and determine what
appropriate settings are.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-09 02:36:59 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 1/3/18 15:13, Tomas Vondra wrote:
> Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
> allow customizing the streaming and memory limit. So you can do
>
> CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)
>
> and this subscription will allow streaming, and the logica_work_mem (on
> provider) will be set to 1MB.
I was wondering already during PG10 development whether we should give
subscriptions a generic configuration array, like databases and roles
have, so we don't have to hardcode a bunch of similar stuff every time
we add an option like this. At the time we only had synchronous_commit,
but now we're adding more.
Also, instead of sticking this into the START_REPLICATION command, could
we just run a SET command? That should work over replication
connections as well.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-11 19:41:31 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
>
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:
I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here. This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.
The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together. One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction. The top-level data structure for that is the infamous
ReorderBuffer. So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange). When the transaction commits, the
accumulated changes are passed to the output plugin and then freed. If
the transaction aborts, then changes are just thrown away.
So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).
More precisely, the above happens for each subtransaction. When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.
All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk. The way this currently works is
hardcoded, and this patch proposes to change that.
Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk. When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.
This all works mostly fine, but you can construct some more extreme
cases where this can blow up.
Here is a mundane example. Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example). If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.
Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender). And so on. If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB. And if you don't have those 8 GB, then you're stuck at this point.
That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time. That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).
So, this patch proposes to improve this as follows:
- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.
- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096. The configuration setting is also in
terms of memory, not in number of changes.
- The configuration setting is for the total memory usage per decoding
session, not per subtransaction. (So we also keep a running tally for
the entire ReorderBuffer.)
There are two open issues with this patch:
One, this mechanism only applies when recording changes. The processing
at commit time still uses the previous hardcoded mechanism. The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing. There are some
proposals to change this as well, but they are more involved. Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.
Two, what to do when the memory limit is reached. With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit. Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way. The proposed patch searches
through the entire list of transactions to find the largest one. But as
the patch says:
"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."
(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive. Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.
Thoughts?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Greg Stark <stark(at)mit(dot)edu> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-11 23:23:11 |
Message-ID: | CAM-w4HNWQMnE8j9=Fa1QdwfKYs=XQanx03PL23yULbh1hPYSGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 11 January 2018 at 19:41, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> Two, what to do when the memory limit is reached. With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit. Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way. The proposed patch searches
> through the entire list of transactions to find the largest one. But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.
Is there any actual advantage to picking the largest transaction? it
means fewer spills and fewer unspills at commit time but that just a
bigger spike of i/o and more of a chance of spilling more than
necessary to get by. In the end it'll be more or less the same amount
of data read back, just all in one big spike when spilling and one big
spike when committing. If you spilled smaller transactions you would
have a small amount of i/o more frequently and have to read back small
amounts for many commits. But it would add up to the same amount of
i/o (or less if you avoid spilling more than necessary).
The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.
--
greg
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-12 16:35:41 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 1/11/18 18:23, Greg Stark wrote:
> AIUI spilling to disk doesn't affect absorbing future updates, we
> would just keep accumulating them in memory right? We won't need to
> unspill until it comes time to commit.
Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again. But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).
So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.
But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.
> The real aim should be to try to pick the transaction that will be
> committed furthest in the future. That gives you the most memory to
> use for live transactions for the longest time and could let you
> process the maximum amount of transactions without spilling them. So
> either the oldest transaction (in the expectation that it's been open
> a while and appears to be a long-lived batch job that will stay open
> for a long time) or the youngest transaction (in the expectation that
> all transactions are more or less equally long-lived) might make
> sense.
Yes, that makes sense. We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-13 04:01:38 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/11/2018 08:41 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
>
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here. This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> ... snip ...
Thanks for a comprehensive summary of the patch!
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive. Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
Yeah, when writing that comment I was worried that (a) might get rather
expensive. I was thinking about maintaining a dlist of transactions
sorted by size (ReorderBuffer now only has a hash table), so that we
could evict transactions from the beginning of the list.
But while that speeds up the choice of transactions to evict, the added
cost is rather high, particularly when most transactions are roughly of
the same size. Because in that case we probably have to move the nodes
around in the list quite often. So it seems wiser to just walk the list
once when looking for a victim.
What I'm thinking about instead is tracking just some approximated
version of this - it does not really matter whether we evict the really
largest transaction or one that is a couple of kilobytes smaller. What
we care about is an answer to this question:
Is there some very large transaction that we could evict to free
a lot of memory, or are all transactions fairly small?
So perhaps we can define some "size classes" and track to which of them
each transaction belongs. For example, we could split the memory limit
into 100 buckets, each representing a 1% size increment.
A transaction would not switch the class very often, and it would be
trivial to pick the largest transaction. When all the transactions are
squashed in the smallest classes, we may switch to some alternative
strategy. Not sure.
In any case, I don't really know how expensive the selection actually
is, and if it's an issue. I'll do some measurements.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-13 04:19:27 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/12/2018 05:35 PM, Peter Eisentraut wrote:
> On 1/11/18 18:23, Greg Stark wrote:
>> AIUI spilling to disk doesn't affect absorbing future updates, we
>> would just keep accumulating them in memory right? We won't need to
>> unspill until it comes time to commit.
>
> Once a transaction has been serialized, future updates keep accumulating
> in memory, until perhaps it gets serialized again. But then at commit
> time, if a transaction has been partially serialized at all, all the
> remaining changes are also serialized before the whole thing is read
> back in (see reorderbuffer.c line 855).
>
> So one optimization would be to specially keep track of all transactions
> that have been serialized already and pick those first for further
> serialization, because it will be done eventually anyway.
>
> But this is only a secondary optimization, because it doesn't help in
> the extreme cases that either no (or few) transactions have been
> serialized or all (or most) transactions have been serialized.
>
>> The real aim should be to try to pick the transaction that will be
>> committed furthest in the future. That gives you the most memory to
>> use for live transactions for the longest time and could let you
>> process the maximum amount of transactions without spilling them. So
>> either the oldest transaction (in the expectation that it's been open
>> a while and appears to be a long-lived batch job that will stay open
>> for a long time) or the youngest transaction (in the expectation that
>> all transactions are more or less equally long-lived) might make
>> sense.
>
> Yes, that makes sense. We'd still need to keep a separate ordered list
> of transactions somewhere, but that might be easier if we just order
> them in the order we see them.
>
Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
don't really commit independently, but as part of the toplevel xact. And
that list is ordered by LSN, which is pretty much exactly the order in
which we see the transactions.
I feel somewhat uncomfortable about evicting oldest (or youngest)
transactions for based on some assumed correlation with the commit
order. I'm pretty sure that will bite us badly for some workloads.
Another somewhat non-intuitive detail is that because ReorderBuffer
switched to Generation allocator for changes (which usually represent
99% of the memory used during decoding), it does not reuse memory the
way AllocSet does. Actually, it does not reuse memory at all, aiming to
eventually give the memory back to libc (which AllocSet can't do).
Because of this evicting the youngest transactions seems like a quite
bad idea, because those chunks will not be reused and there may be other
chunks on the blocks, preventing their release.
Yeah, complicated stuff.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-18 13:24:06 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 1/12/18 23:19, Tomas Vondra wrote:
> Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
> don't really commit independently, but as part of the toplevel xact. And
> that list is ordered by LSN, which is pretty much exactly the order in
> which we see the transactions.
Yes indeed. There is even ReorderBufferGetOldestTXN().
> Another somewhat non-intuitive detail is that because ReorderBuffer
> switched to Generation allocator for changes (which usually represent
> 99% of the memory used during decoding), it does not reuse memory the
> way AllocSet does. Actually, it does not reuse memory at all, aiming to
> eventually give the memory back to libc (which AllocSet can't do).
>
> Because of this evicting the youngest transactions seems like a quite
> bad idea, because those chunks will not be reused and there may be other
> chunks on the blocks, preventing their release.
Right. But this raises the question whether we are doing the memory
accounting on the right level. If we are doing all this tracking based
on ReorderBufferChanges, but then serializing changes possibly doesn't
actually free any memory in the operating system, that's no good. Can
we get some usage statistics out of the memory context? It seems like
we need to keep serializing transactions until we actually see the
memory context size drop.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-19 14:34:05 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gz | application/gzip | 9.9 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gz | application/gzip | 10.8 KB |
0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gz | application/gzip | 18.3 KB |
0007-Track-statistics-for-streaming-spilling-v5.patch.gz | application/gzip | 4.3 KB |
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gz | application/gzip | 570 bytes |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gz | application/gzip | 630 bytes |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-19 22:08:02 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/19/2018 03:34 PM, Tomas Vondra wrote:
> Attached is v5, fixing a silly bug in part 0006, causing segfault when
> creating a subscription.
>
Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v6.patch.gz | application/gzip | 9.9 KB |
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v6.patch.gz | application/gzip | 2.7 KB |
0003-Issue-individual-invalidations-with-wal_level-log-v6.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods-v6.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-v6.patch.gz | application/gzip | 10.8 KB |
0006-Add-support-for-streaming-to-built-in-replication-v6.patch.gz | application/gzip | 18.3 KB |
0007-Track-statistics-for-streaming-spilling-v6.patch.gz | application/gzip | 4.3 KB |
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v6.patch.gz | application/gzip | 570 bytes |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v6.patch.gz | application/gzip | 631 bytes |
From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-01-31 06:53:06 |
Message-ID: | CAD21AoBa2mhbhZW6hnDRKs1ggycJOBE69eHk3KgCJ5Cwj9-ZkA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 01/19/2018 03:34 PM, Tomas Vondra wrote:
>> Attached is v5, fixing a silly bug in part 0006, causing segfault when
>> creating a subscription.
>>
>
> Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
> causing another failure. Hopefully v6 will pass the CI build, it does
> pass a build with the same parameters on my system.
Thank you for working on this. This patch would be helpful for
synchronous replication.
I haven't looked at the code deeply yet, but I've reviewed the v6
patch set especially on subscriber side. All of the patches can be
applied to current HEAD cleanly. Here is review comment.
----
CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
on publisher side when starting replication. Probably we should check
the value on the subscriber side as well.
----
When streaming = on, if we drop subscription in the middle of
receiving stream changes, DROP SUBSCRIPTION could leak tmp files
(.chages file and .subxacts file). Also it also happens when a
transaction on upstream aborted without abort record.
----
Since we can change both streaming option and work_mem option by ALTER
SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
----
If we create a subscription without any options, both
pg_subscription.substream and pg_subscription.subworkmem are set to
null. However, since GetSubscription are't aware of NULL we start the
replication with invalid options like follows.
LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
publication_names '"hoge_pub"')
I think we can set substream to false and subworkmem to -1 instead of
null, and then makes libpqrcv_startstreaming not set streaming option
if stream is -1.
----
Some WARNING messages appeared. Maybe these are for debug purpose?
WARNING: updating stream stats 0x1c12ef8 4 3 65604
WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-02-01 14:51:18 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed. Feel free to move them to the next commit fest when you
think they are ready to be continued.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-02-01 22:50:25 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
> On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 01/19/2018 03:34 PM, Tomas Vondra wrote:
>>> Attached is v5, fixing a silly bug in part 0006, causing segfault when
>>> creating a subscription.
>>>
>>
>> Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
>> causing another failure. Hopefully v6 will pass the CI build, it does
>> pass a build with the same parameters on my system.
>
> Thank you for working on this. This patch would be helpful for
> synchronous replication.
>
> I haven't looked at the code deeply yet, but I've reviewed the v6
> patch set especially on subscriber side. All of the patches can be
> applied to current HEAD cleanly. Here is review comment.
>
> ----
> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
> on publisher side when starting replication. Probably we should check
> the value on the subscriber side as well.
>
> ----
> When streaming = on, if we drop subscription in the middle of
> receiving stream changes, DROP SUBSCRIPTION could leak tmp files
> (.chages file and .subxacts file). Also it also happens when a
> transaction on upstream aborted without abort record.
>
> ----
> Since we can change both streaming option and work_mem option by ALTER
> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
>
> ----
> If we create a subscription without any options, both
> pg_subscription.substream and pg_subscription.subworkmem are set to
> null. However, since GetSubscription are't aware of NULL we start the
> replication with invalid options like follows.
> LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
> publication_names '"hoge_pub"')
>
> I think we can set substream to false and subworkmem to -1 instead of
> null, and then makes libpqrcv_startstreaming not set streaming option
> if stream is -1.
>
> ----
> Some WARNING messages appeared. Maybe these are for debug purpose?
>
> WARNING: updating stream stats 0x1c12ef8 4 3 65604
> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
>
> Regards,
>
Thanks for the review! I'll address the issues in the next version of
the patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-02-01 22:51:55 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
> To close out this commit fest, I'm setting both of these patches as
> returned with feedback, as there are apparently significant issues to be
> addressed. Feel free to move them to the next commit fest when you
> think they are ready to be continued.
>
Will do. Thanks for the feedback.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 01:12:35 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
> > To close out this commit fest, I'm setting both of these patches as
> > returned with feedback, as there are apparently significant issues to be
> > addressed. Feel free to move them to the next commit fest when you
> > think they are ready to be continued.
> >
>
> Will do. Thanks for the feedback.
Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?
Greetings,
Andres Freund
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 02:33:16 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/02/2018 02:12 AM, Andres Freund wrote:
> On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
>> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
>>> To close out this commit fest, I'm setting both of these patches as
>>> returned with feedback, as there are apparently significant issues to be
>>> addressed. Feel free to move them to the next commit fest when you
>>> think they are ready to be continued.
>>>
>>
>> Will do. Thanks for the feedback.
>
> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
> but I don't see a newer version posted?
>
Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | David Steele <david(at)pgmasters(dot)net> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 02:39:36 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Tomas.
On 3/1/18 9:33 PM, Tomas Vondra wrote:
> On 03/02/2018 02:12 AM, Andres Freund wrote:
>> On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
>>> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
>>>> To close out this commit fest, I'm setting both of these patches as
>>>> returned with feedback, as there are apparently significant issues to be
>>>> addressed. Feel free to move them to the next commit fest when you
>>>> think they are ready to be continued.
>>>>
>>>
>>> Will do. Thanks for the feedback.
>>
>> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
>> but I don't see a newer version posted?
>>
>
> Ah, apologies - that's due to moving the patch from the last CF (it was
> marked as RWF so I had to reopen it before moving it). I'll submit a new
> version of the patch shortly, please mark it as WOA until then.
Marked as Waiting on Author.
--
-David
david(at)pgmasters(dot)net
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | David Steele <david(at)pgmasters(dot)net> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 20:05:29 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
On 2018-03-01 21:39:36 -0500, David Steele wrote:
> On 3/1/18 9:33 PM, Tomas Vondra wrote:
> > On 03/02/2018 02:12 AM, Andres Freund wrote:
> > > Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
> > > but I don't see a newer version posted?
> > >
> >
> > Ah, apologies - that's due to moving the patch from the last CF (it was
> > marked as RWF so I had to reopen it before moving it). I'll submit a new
> > version of the patch shortly, please mark it as WOA until then.
>
> Marked as Waiting on Author.
Sorry to be the hard-ass, but given this patch hasn't been moved forward
since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
first place?
Greetings,
Andres Freund
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 20:06:05 |
Message-ID: | CA+TgmoaWMn2SaMqjmYQxV2FBUtuJ3uce28W=zS5n4T058u_Tow@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Ah, apologies - that's due to moving the patch from the last CF (it was
> marked as RWF so I had to reopen it before moving it). I'll submit a new
> version of the patch shortly, please mark it as WOA until then.
So, the way it's supposed to work is you resubmit the patch first and
then re-activate the CF entry. If you get to re-activate the CF entry
without actually updating the patch, and then submit the patch
afterwards, then the CF deadline becomes largely meaningless. I think
a new patch should rejected as untimely.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | David Steele <david(at)pgmasters(dot)net> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 20:21:19 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 3/2/18 3:06 PM, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Ah, apologies - that's due to moving the patch from the last CF (it was
>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>> version of the patch shortly, please mark it as WOA until then.
>
> So, the way it's supposed to work is you resubmit the patch first and
> then re-activate the CF entry. If you get to re-activate the CF entry
> without actually updating the patch, and then submit the patch
> afterwards, then the CF deadline becomes largely meaningless. I think
> a new patch should rejected as untimely.
Hmmm, I missed that implication last night. I'll mark this Returned
with Feedback.
Tomas, please move to the next CF once you have an updated patch.
Thanks,
--
-David
david(at)pgmasters(dot)net
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-02 21:06:32 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/02/2018 09:21 PM, David Steele wrote:
> On 3/2/18 3:06 PM, Robert Haas wrote:
>> On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> Ah, apologies - that's due to moving the patch from the last CF (it was
>>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>>> version of the patch shortly, please mark it as WOA until then.
>>
>> So, the way it's supposed to work is you resubmit the patch first and
>> then re-activate the CF entry. If you get to re-activate the CF entry
>> without actually updating the patch, and then submit the patch
>> afterwards, then the CF deadline becomes largely meaningless. I think
>> a new patch should rejected as untimely.
>
> Hmmm, I missed that implication last night. I'll mark this Returned
> with Feedback.
>
> Tomas, please move to the next CF once you have an updated patch.
>
Can you guys please point me to the CF rules that say this? Because my
understanding (and not just mine, AFAICS) was obviously different.
Clearly there's a disconnect somewhere.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 00:55:39 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi there,
attached is an updated patch fixing all the reported issues (a bit more
about those below).
The main change in this patch version is reworked logging of subxact
assignments, which needs to be done immediately for incremental decoding
to work properly.
The previous patch versions did that by logging a separate xlog record,
which however had rather noticeable space overhead (~40% on a worst-case
test - tiny table, no FPWs, ...). While in practice the overhead would
be much closer to 0%, it still seemed unacceptable.
Andres proposed doing something like we do with replication origins in
XLogRecordAssemble, i.e. inventing a special block, and embedding the
assignment info into that (in the next xlog record). This turned out to
be working quite well, and the worst-case space overhead dropped to ~5%.
I have attempted to do something like that with the invalidations, which
is the other thing that needs to be logged immediately for incremental
decoding to work correctly. The plan was to use the same approach as for
assignments, i.e. embed the invalidations into the next xlog record and
stop sending them in the commit message. That however turned out to be
much more complicated - the embedding is fairly trivial, of course, but
unlike assignments the invalidations are needed for hot standbys. If we
only send them incrementally, I think the standby would have to collect
from the WAL records, and store them in a way that survives restarts.
So for invalidations the patch uses the original approach with a new
type xlog record type (ignored by standby), and still logging the
invalidations in commit record (which is that the standby relies on).
On 02/01/2018 11:50 PM, Tomas Vondra wrote:
> On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
> ...
>> ----
>> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
>> on publisher side when starting replication. Probably we should check
>> the value on the subscriber side as well.
>>
Added.
>> ----
>> When streaming = on, if we drop subscription in the middle of
>> receiving stream changes, DROP SUBSCRIPTION could leak tmp files
>> (.chages file and .subxacts file). Also it also happens when a
>> transaction on upstream aborted without abort record.
>>
Right. The files would get cleaned up eventually during restart (just
like other temporary files), but leaking them after DROP SUBSCRIPTION is
not cool. So I've added a simple tracking of files (or rather streamed
XIDs) in the worker, and clean them explicitly on exit.
>> ----
>> Since we can change both streaming option and work_mem option by ALTER
>> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
>>
Yep, I've added note that work_mem and streaming can also be changed.
Those changes won't be applied to the already running worker, though.
>> ----
>> If we create a subscription without any options, both
>> pg_subscription.substream and pg_subscription.subworkmem are set to
>> null. However, since GetSubscription are't aware of NULL we start the
>> replication with invalid options like follows.
>> LOG: received replication command: START_REPLICATION SLOT "hoge_sub"
>> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
>> publication_names '"hoge_pub"')
>>
>> I think we can set substream to false and subworkmem to -1 instead of
>> null, and then makes libpqrcv_startstreaming not set streaming option
>> if stream is -1.
>>
Good catch! I've done pretty much what you suggested here, i.e. store
-1/false instead and then handle that in libpqrcv_startstreaming.
>> ----
>> Some WARNING messages appeared. Maybe these are for debug purpose?
>>
>> WARNING: updating stream stats 0x1c12ef8 4 3 65604
>> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
>>
Yeah, those should be removed.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gz | application/gzip | 9.9 KB |
0002-Immediatel-WAL-log-assignments.patch.gz | application/gzip | 6.7 KB |
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz | application/gzip | 10.8 KB |
0006-Add-support-for-streaming-to-built-in-replication.patch.gz | application/gzip | 19.2 KB |
0007-Track-statistics-for-streaming-spilling.patch.gz | application/gzip | 4.3 KB |
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gz | application/gzip | 569 bytes |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz | application/gzip | 627 bytes |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:00:46 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/02/2018 09:05 PM, Andres Freund wrote:
> Hi,
>
> On 2018-03-01 21:39:36 -0500, David Steele wrote:
>> On 3/1/18 9:33 PM, Tomas Vondra wrote:
>>> On 03/02/2018 02:12 AM, Andres Freund wrote:
>>>> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
>>>> but I don't see a newer version posted?
>>>>
>>>
>>> Ah, apologies - that's due to moving the patch from the last CF (it was
>>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>>> version of the patch shortly, please mark it as WOA until then.
>>
>> Marked as Waiting on Author.
>
> Sorry to be the hard-ass, but given this patch hasn't been moved forward
> since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
> first place?
>
That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | David Steele <david(at)pgmasters(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:01:53 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
> That is somewhat misleading, I think. You're right the last version was
> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
> right at the end of the CF. So it's not like the patch was sitting there
> with unresolved issues. Based on that review the patch was marked as RWF
> and thus not moved to 2018-03 automatically.
I don't see how this changes anything.
- Andres
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | David Steele <david(at)pgmasters(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:34:06 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/03/2018 02:01 AM, Andres Freund wrote:
> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>> That is somewhat misleading, I think. You're right the last version
>> was submitted on 2018-01-19, but the next review arrived on
>> 2018-01-31, i.e. right at the end of the CF. So it's not like the
>> patch was sitting there with unresolved issues. Based on that
>> review the patch was marked as RWF and thus not moved to 2018-03
>> automatically.
>
> I don't see how this changes anything.
>
You've used "The patch hasn't moved forward since 2018-01-19," as an
argument why the patch is not eligible for 2018-03. I suggest that
argument is misleading, because patches generally do not move without
reviews, and it's difficult to respond to a review that arrives on the
last day of a commitfest.
Consider that without the review, the patch would end up with NR status,
and would be moved to the next CF automatically. Isn't that a bit weird?
kind regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | David Steele <david(at)pgmasters(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:36:10 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2018-03-03 02:34:06 +0100, Tomas Vondra wrote:
> On 03/03/2018 02:01 AM, Andres Freund wrote:
> > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
> >> That is somewhat misleading, I think. You're right the last version
> >> was submitted on 2018-01-19, but the next review arrived on
> >> 2018-01-31, i.e. right at the end of the CF. So it's not like the
> >> patch was sitting there with unresolved issues. Based on that
> >> review the patch was marked as RWF and thus not moved to 2018-03
> >> automatically.
> >
> > I don't see how this changes anything.
> >
>
> You've used "The patch hasn't moved forward since 2018-01-19," as an
> argument why the patch is not eligible for 2018-03. I suggest that
> argument is misleading, because patches generally do not move without
> reviews, and it's difficult to respond to a review that arrives on the
> last day of a commitfest.
>
> Consider that without the review, the patch would end up with NR status,
> and would be moved to the next CF automatically. Isn't that a bit weird?
Not sure I follow. The point is that nobody would have complained if
you'd moved the patch into this fest if you'd updated it *before* it
started?
Greetings,
Andres Freund
From: | David Steele <david(at)pgmasters(dot)net> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:37:26 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 3/2/18 8:01 PM, Andres Freund wrote:
> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>> That is somewhat misleading, I think. You're right the last version was
>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>> right at the end of the CF. So it's not like the patch was sitting there
>> with unresolved issues. Based on that review the patch was marked as RWF
>> and thus not moved to 2018-03 automatically.
>
> I don't see how this changes anything.
I agree that things could be clearer, and Andres has produced a great
document that we can build on. The old one had gotten a bit stale.
However, I think it's pretty obvious that a CF entry should be
accompanied with a patch. It sounds like the timing was awkward but you
still had 28 days to produce a new patch.
I also notice that you submitted 7 patches in this CF but are reviewing
zero.
--
-David
david(at)pgmasters(dot)net
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 01:54:41 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/03/2018 02:37 AM, David Steele wrote:
> On 3/2/18 8:01 PM, Andres Freund wrote:
>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>> That is somewhat misleading, I think. You're right the last version was
>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>> right at the end of the CF. So it's not like the patch was sitting there
>>> with unresolved issues. Based on that review the patch was marked as RWF
>>> and thus not moved to 2018-03 automatically.
>>
>> I don't see how this changes anything.
>
> I agree that things could be clearer, and Andres has produced a great
> document that we can build on. The old one had gotten a bit stale.
>
> However, I think it's pretty obvious that a CF entry should be
> accompanied with a patch. It sounds like the timing was awkward but
> you still had 28 days to produce a new patch.
>
Based on internal discussion I'm not so sure about the "pretty obvious"
part. It certainly wasn't that obvious to me, otherwise I'd submit the
revised patch earlier - hindsight is 20/20.
> I also notice that you submitted 7 patches in this CF but are
> reviewing zero.
>
I've volunteered to review a couple of patches at the FOSDEM Developer
Meeting - I thought Stephen was entering that into the CF app, not sure
where it got lost.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | David Steele <david(at)pgmasters(dot)net> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 02:08:22 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 3/2/18 8:54 PM, Tomas Vondra wrote:
> On 03/03/2018 02:37 AM, David Steele wrote:
>> On 3/2/18 8:01 PM, Andres Freund wrote:
>>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>>> That is somewhat misleading, I think. You're right the last version was
>>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>>> right at the end of the CF. So it's not like the patch was sitting there
>>>> with unresolved issues. Based on that review the patch was marked as RWF
>>>> and thus not moved to 2018-03 automatically.
>>>
>>> I don't see how this changes anything.
>>
>> I agree that things could be clearer, and Andres has produced a great
>> document that we can build on. The old one had gotten a bit stale.
>>
>> However, I think it's pretty obvious that a CF entry should be
>> accompanied with a patch. It sounds like the timing was awkward but
>> you still had 28 days to produce a new patch.
>
> Based on internal discussion I'm not so sure about the "pretty obvious"
> part. It certainly wasn't that obvious to me, otherwise I'd submit the
> revised patch earlier - hindsight is 20/20.
Indeed it is. Be assured that nobody takes pleasure in pushing patches,
but we have limited resources and must make some choices.
>> I also notice that you submitted 7 patches in this CF but are
>> reviewing zero.
>
> I've volunteered to review a couple of patches at the FOSDEM Developer
> Meeting - I thought Stephen was entering that into the CF app, not sure
> where it got lost.
There are plenty of patches that need review, so go for it.
Regards,
--
-David
david(at)pgmasters(dot)net
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 05:19:56 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2018-03-03 01:55, Tomas Vondra wrote:
> Hi there,
>
> attached is an updated patch fixing all the reported issues (a bit more
> about those below).
Hi,
0007-Track-statistics-for-streaming-spilling.patch won't apply. All
the other patches apply ok.
patch complaints with:
patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).
Attached the produced reject file.
thanks,
Erik Rijkers
Attachment | Content-Type | Size |
---|---|---|
pg_proc.h.rej | text/x-diff | 1.9 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-03 14:52:40 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 03/03/2018 06:19 AM, Erik Rijkers wrote:
> On 2018-03-03 01:55, Tomas Vondra wrote:
>> Hi there,
>>
>> attached is an updated patch fixing all the reported issues (a bit more
>> about those below).
>
> Hi,
>
> 0007-Track-statistics-for-streaming-spilling.patch won't apply. All
> the other patches apply ok.
>
> patch complaints with:
>
> patching file doc/src/sgml/monitoring.sgml
> patching file src/backend/catalog/system_views.sql
> Hunk #1 succeeded at 734 (offset 2 lines).
> patching file src/backend/replication/logical/reorderbuffer.c
> patching file src/backend/replication/walsender.c
> patching file src/include/catalog/pg_proc.h
> Hunk #1 FAILED at 2903.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/include/catalog/pg_proc.h.rej
> patching file src/include/replication/reorderbuffer.h
> patching file src/include/replication/walsender_private.h
> patching file src/test/regress/expected/rules.out
> Hunk #1 succeeded at 1861 (offset 2 lines).
>
> Attached the produced reject file.
>
Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gz | application/gzip | 9.9 KB |
0002-Immediatel-WAL-log-assignments.patch.gz | application/gzip | 6.7 KB |
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz | application/gzip | 4.8 KB |
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz | application/gzip | 5.3 KB |
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz | application/gzip | 10.8 KB |
0006-Add-support-for-streaming-to-built-in-replication.patch.gz | application/gzip | 19.6 KB |
0007-Track-statistics-for-streaming-spilling.patch.gz | application/gzip | 4.3 KB |
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gz | application/gzip | 570 bytes |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz | application/gzip | 626 bytes |
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-09 16:07:55 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I think this patch is not going to be ready for PG11.
- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.
- Various details in the logical_work_mem patch (0001) are unresolved.
- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).
That said, the feature seems useful and desirable, and the
implementation makes sense. There are documentation and tests. But
there is a significant amount of design and coding work still necessary.
Attached is a fixup patch that I needed to make it compile.
The last two patches in your series (0008, 0009) are labeled as bug
fixes. Would you like to argue that they should be applied
independently of the rest of the feature?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-fixup-Track-statistics-for-streaming-spilling.patch | text/plain | 2.5 KB |
From: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-03-29 16:34:58 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 11.01.2018 22:41, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here. This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together. One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction. The top-level data structure for that is the infamous
> ReorderBuffer. So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange). When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed. If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction. When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk. The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk. When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example. Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example). If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender). And so on. If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB. And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time. That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096. The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction. (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes. The processing
> at commit time still uses the previous hardcoded mechanism. The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing. There are some
> proposals to change this as well, but they are more involved. Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached. With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit. Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way. The proposed patch searches
> through the entire list of transactions to find the largest one. But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive. Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>
I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed
of importing data in multimaster and shardman (sharding based on FDW
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and
once again spilled to the disk at replica before assignment to the
particular apply worker
(last one is specific of multimaster, which can apply received
transactions concurrently).
We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference
is that we do not want to spill transaction in temporary file at
replica, but apply it immediately in separate backend and abort
transaction if it is aborted at master. Certainly it will work only with
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it
shows about 1.5x slowdown comparing with standalone/async LR and about
2-3 improvement comparing with sync LR. For two replicas result is 2x
slower than async LR and 2-8 times faster than sync LR (depending on
number of concurrent connections).
Approach 3) seems to be specific to multimaster/shardman, so most likely
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?
Right now in the proposed patch you just move spilling to the file from
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at
master.
But if we have just one huge transaction (COPY) importing gigabytes of
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at
replica side.
In this sense this patch doesn't solve the problem with slow load of
large bulks of data though LR.
Alternatively (approach 2) we can have small in-memory buffer for
decoding transaction and remember LSN and snapshot of this transaction
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL
several times but do not need to spill anything to the disk neither at
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long
interleaving transactions. But in most customer's use cases we have
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for
implementation than transaction streaming. And it doesn't require any
changes in output plugin API.
I realize that it is a little bit late to ask this question once your
patch is almost ready, but what do you think about it? Are there some
pitfals with this approach?
There is one more aspect and performance problem with LR we have faced
with shardman: if there are several publications for different subsets
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding
transactions which have no relation to this publication. But WAL sender
doesn't know it until it reaches the end of this transaction. What is
worser: if transaction is huge, then all WAL senders will spill it to
the disk even through only one of them actually needs it. So data of
huge transaction is written not three times, but N times, where N is
number of publications. The only solution of the problem we can imagine
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large
portions of WAL without decoding. We also want to know opinion of
2ndQuandarnt about this idea.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-07-01 11:43:50 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
This patch set was not updated for the 2018-07 commitfest, so moved to -09.
On 09.03.18 17:07, Peter Eisentraut wrote:
> I think this patch is not going to be ready for PG11.
>
> - It depends on some work in the thread "logical decoding of two-phase
> transactions", which is still in progress.
>
> - Various details in the logical_work_mem patch (0001) are unresolved.
>
> - This being partially a performance feature, we haven't seen any
> performance tests (e.g., which settings result in which latencies under
> which workloads).
>
> That said, the feature seems useful and desirable, and the
> implementation makes sense. There are documentation and tests. But
> there is a significant amount of design and coding work still necessary.
>
> Attached is a fixup patch that I needed to make it compile.
>
> The last two patches in your series (0008, 0009) are labeled as bug
> fixes. Would you like to argue that they should be applied
> independently of the rest of the feature?
>
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-10-02 04:59:02 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote:
> Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
> Attached is a rebased patch, fixing this.
The latest patch set does not apply anymore, and had no activity for the
last two months, so I am marking it as returned with feedback.
--
Michael
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-16 14:31:52 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
Attached is an updated version of this patch series. It's meant to be
applied on top of the 2pc decoding patch [1], because streaming of
in-progress transactions requires handling of concurrent aborts. So it
may or may not apply directly to master, I'm not sure - unfortunately
that's likely to confuse the cputube thing, but I don't want to include
the 2pc decoding bits here because that would be just confusing.
If needed, the part introducing logical_work_mem limit for ReorderBuffer
can be separated and committed independently, but I do expect this to be
committed after the 2pc decoding patch so I've left it like this.
This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.
For the logical_work_mem part, I think this is quite solid. The main
question is how to pick transactions for eviction. For now it uses the
same approach as master (i.e. picking the largest top-level transaction,
although measured by amount of memory and not just number of changes).
But I've realized that may not work with Generation context that great,
because unlike AllocSet it does not reuse the memory. That's nice as it
allows freeing old blocks (which AllocSet can't), but it means a small
transaction can have a change on old blocks preventing free(). That is
something we have in pg11 already, because that's where Generation
context got introduced - I haven't seen this issue in practice, but we
might need to do something about it.
In any case, I'm thinking we may need to pick a different eviction
algorithm - say using a transaction with the oldest change (and loop
until we release at least one block in the Generation context), or maybe
look for block mixing changes from the smallest number of transactions,
or something like that. Other ideas are welcome. I don't think the exact
algorithm is particularly critical, because it's meant to be triggered
only very rarely (i.e. pick logical_work_mem high enough).
The in-progress streaming is mostly mechanical extension of existing
functionality (new methods in various APIs, ...) and refactoring of
ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from
reviews, of course.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181216.patch.gz | application/gzip | 9.9 KB |
0002-Immediately-WAL-log-assignments-20181216.patch.gz | application/gzip | 6.3 KB |
0003-Issue-individual-invalidations-with-wal_lev-20181216.patch.gz | application/gzip | 4.9 KB |
0004-Extend-the-output-plugin-API-with-stream-me-20181216.patch.gz | application/gzip | 5.7 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-20181216.patch.gz | application/gzip | 11.2 KB |
0006-Add-support-for-streaming-to-built-in-repli-20181216.patch.gz | application/gzip | 19.7 KB |
0007-Track-statistics-for-streaming-spilling-20181216.patch.gz | application/gzip | 4.2 KB |
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181216.patch.gz | application/gzip | 637 bytes |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-16 17:54:47 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
FWIW the original CF entry in 2018-07 [1] was marked as RWF. I'm not
sure what's the right way to resubmit such patches, so I've created a
new entry in 2019-01 [2] referencing the same hackers thread (and with
the same authors/reviewers metadata).
[1] https://commitfest.postgresql.org/19/1429/
[2] https://commitfest.postgresql.org/21/1927/
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-17 16:23:44 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Tomas,
> This new version is mostly just a rebase to current master (or almost,
> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
> but it also addresses the new stuff committed since last version (most
> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
> subxact assignments, where the assignment was included in records with
> XID=0, essentially failing to track the subxact properly.
I started reviewing your patch about a month ago and tried to do an
in-depth review, since I am very interested in this patch too. The new
version is not applicable to master 29180e5d78, but everything is OK
after applying 2pc patch before. Anyway, I guess it may complicate
further testing and review, since any potential reviewer has to take
into account both patches at once. Previous version was applicable to
master and was working fine for me separately (excepting a few
patch-specific issues, which I try to explain below).
Patch review
========
First of all, I want to say thank you for such a huge work done. Here
are some problems, which I have found and hopefully fixed with my
additional patch (please, find attached, it should be applicable to the
last commit of your newest patch version):
1) The most important issue is that your tap tests were broken—there was
missing option "WITH (streaming=true)" in the subscription creating
statement. Therefore, spilling mechanism has been tested rather than
streaming.
2) After fixing tests the first one with simple streaming is immediately
failed, because of logical replication worker segmentation fault. It
happens, since worker tries to call stream_cleanup_files inside
stream_open_file at the stream start, while nxids is zero, then it goes
to the negative value and everything crashes. Something similar may
happen with xids array, so I added two checks there.
3) The next problem is much more critical and is dedicated to historic
MVCC visibility rules. Previously, walsender was starting to decode
transaction on commit and we were able to resolve all xmin, xmax,
combocids to cmin/cmax, build tuplecids hash and so on, but now we start
doing all these things on the fly.
Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
is trying to validate catalog tuples, which are currently in the future
relatively to the current decoder position inside transaction, e.g. we
may want to resolve cmin/cmax of a tuple, which was created with cid 3
and deleted with cid 5, while we are currently at cid 4, so our
tuplecids hash is not complete to handle such a case.
I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two
options:
/*
* If we accidentally see a tuple from our transaction, but cannot
resolve its
* cmin, so probably it is from the future, thus drop it.
*/
if (!resolved)
return false;
and
/*
* If we accidentally see a tuple from our transaction, but cannot
resolve its
* cmax or cmax == InvalidCommandId, so probably it is still valid,
thus accept it.
*/
if (!resolved || cmax == InvalidCommandId)
return true;
4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.
5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.
6) To better handle all these tricky cases I added new tap
test—014_stream_tough_ddl.pl—which consist of really tough combination
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
I marked all my fixes and every questionable place with comment and
"TOCHECK:" label for easy search. Removing of pretty any of these fixes
leads to the tests fail due to the segmentation fault or replication
mismatch. Though I mostly read and tested old version of patch, but
after a quick look it seems that all these fixes are applicable to the
new version of patch as well.
Performance
========
I have also performed a series of performance tests, and found that
patch adds a huge overhead in the case of a large transaction consisting
of many small rows, e.g.:
CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
precision);
EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);
Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)
With synchronous_standby_names and 64 MB logical_work_mem it takes up to
x5 longer, while without patch it is about x2. Thus, logical replication
streaming is approximately x4 as slower for similar transactions.
However, dealing with large transactions consisting of a small number of
large rows is much better:
CREATE TABLE large_text (t TEXT);
EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)
It is around the same x2 as without patch. If someone is interested I
also added flamegraphs of walsender and logical replication worker
during first numerical transaction processing.
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
logical_repl_worker_new_perf.svg.zip | application/zip | 56.0 KB |
walsender_new_perf.svg.zip | application/zip | 34.7 KB |
0009-Fix-worker-historic-MVCC-visibility-rules-subxacts-s.patch | text/x-patch | 20.4 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-17 22:28:06 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Alexey,
Thanks for the thorough and extremely valuable review!
On 12/17/18 5:23 PM, Alexey Kondratov wrote:
> Hi Tomas,
>
>> This new version is mostly just a rebase to current master (or almost,
>> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
>> but it also addresses the new stuff committed since last version (most
>> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
>> subxact assignments, where the assignment was included in records with
>> XID=0, essentially failing to track the subxact properly.
>
> I started reviewing your patch about a month ago and tried to do an
> in-depth review, since I am very interested in this patch too. The new
> version is not applicable to master 29180e5d78, but everything is OK
> after applying 2pc patch before. Anyway, I guess it may complicate
> further testing and review, since any potential reviewer has to take
> into account both patches at once. Previous version was applicable to
> master and was working fine for me separately (excepting a few
> patch-specific issues, which I try to explain below).
>
I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).
>
> Patch review
> ========
>
> First of all, I want to say thank you for such a huge work done. Here
> are some problems, which I have found and hopefully fixed with my
> additional patch (please, find attached, it should be applicable to the
> last commit of your newest patch version):
>
> 1) The most important issue is that your tap tests were broken—there was
> missing option "WITH (streaming=true)" in the subscription creating
> statement. Therefore, spilling mechanism has been tested rather than
> streaming.
>
D'oh!
> 2) After fixing tests the first one with simple streaming is immediately
> failed, because of logical replication worker segmentation fault. It
> happens, since worker tries to call stream_cleanup_files inside
> stream_open_file at the stream start, while nxids is zero, then it goes
> to the negative value and everything crashes. Something similar may
> happen with xids array, so I added two checks there.
>
> 3) The next problem is much more critical and is dedicated to historic
> MVCC visibility rules. Previously, walsender was starting to decode
> transaction on commit and we were able to resolve all xmin, xmax,
> combocids to cmin/cmax, build tuplecids hash and so on, but now we start
> doing all these things on the fly.
>
> Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
> is trying to validate catalog tuples, which are currently in the future
> relatively to the current decoder position inside transaction, e.g. we
> may want to resolve cmin/cmax of a tuple, which was created with cid 3
> and deleted with cid 5, while we are currently at cid 4, so our
> tuplecids hash is not complete to handle such a case.
>
Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.
>
> 4) There was a problem with marking top-level transaction as having
> catalog changes if one of its subtransactions has. It was causing a
> problem with DDL statements just after subtransaction start (savepoint),
> so data from new columns is not replicated.
>
> 5) Similar issue with schema send. You send schema only once per each
> sub/transaction (IIRC), while we have to update schema on each catalog
> change: invalidation execution, snapshot rebuild, adding new tuple cids.
> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
> it is easy to set it inside RB and read in the output plugin. Probably,
> we have to choose a better place for this flag.
>
Hmm. Can you share an example how to trigger these issues?
> 6) To better handle all these tricky cases I added new tap
> test—014_stream_tough_ddl.pl—which consist of really tough combination
> of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
>
Thanks!
> I marked all my fixes and every questionable place with comment and
> "TOCHECK:" label for easy search. Removing of pretty any of these fixes
> leads to the tests fail due to the segmentation fault or replication
> mismatch. Though I mostly read and tested old version of patch, but
> after a quick look it seems that all these fixes are applicable to the
> new version of patch as well.
>
Thanks. I'll go through your patch tomorrow.
>
> Performance
> ========
>
> I have also performed a series of performance tests, and found that
> patch adds a huge overhead in the case of a large transaction consisting
> of many small rows, e.g.:
>
> CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
> precision);
>
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
>
> Execution Time: 2407.709 ms
> Total Time: 11494,238 ms (00:11,494)
>
> With synchronous_standby_names and 64 MB logical_work_mem it takes up to
> x5 longer, while without patch it is about x2. Thus, logical replication
> streaming is approximately x4 as slower for similar transactions.
>
> However, dealing with large transactions consisting of a small number of
> large rows is much better:
>
> CREATE TABLE large_text (t TEXT);
>
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
>
> Execution Time: 3545.642 ms
> Total Time: 7678,617 ms (00:07,679)
>
> It is around the same x2 as without patch. If someone is interested I
> also added flamegraphs of walsender and logical replication worker
> during first numerical transaction processing.
>
Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).
I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:
1) Some overhead is expected, due to doing things incrementally.
2) The memory limit should be set to sufficiently high value to be hit
only infrequently.
3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.
So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-18 14:07:08 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 18.12.2018 1:28, Tomas Vondra wrote:
>> 4) There was a problem with marking top-level transaction as having
>> catalog changes if one of its subtransactions has. It was causing a
>> problem with DDL statements just after subtransaction start (savepoint),
>> so data from new columns is not replicated.
>>
>> 5) Similar issue with schema send. You send schema only once per each
>> sub/transaction (IIRC), while we have to update schema on each catalog
>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>> it is easy to set it inside RB and read in the output plugin. Probably,
>> we have to choose a better place for this flag.
>>
> Hmm. Can you share an example how to trigger these issues?
Test cases inside 014_stream_tough_ddl.pl and old ones (with
streaming=true option added) should reproduce all these issues. In
general, it happens in a txn like:
INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT
then the second insert may discover old version of catalog.
> Interesting. Any idea where does the extra overhead in this particular
> case come from? It's hard to deduce that from the single flame graph,
> when I don't have anything to compare it with (i.e. the flame graph for
> the "normal" case).
I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)
where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.
> So I'm not particularly worried, but I'll look into that. I'd be much
> more worried if there was measurable overhead in cases when there's no
> streaming happening (either because it's disabled or the memory limit
> was not hit).
What I have also just found, is that if a table row is large enough to
be TOASTed, e.g.:
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
then logical_work_mem limit is not hit and we neither stream, nor spill
to disk this transaction, while it is still large. In contrast, the
transaction above (with 1000000 smaller rows) being comparable in size
is streamed. Not sure, that it is easy to add proper accounting of
TOAST-able columns, but it worth it.
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
logical_repl_worker_text_stream_perf.zip | application/zip | 42.1 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-18 23:56:07 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Alexey,
Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.
I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:
Test Summary Report
-------------------
t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 7 tests but ran 1.
t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1
So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.
On 12/18/18 3:07 PM, Alexey Kondratov wrote:
> On 18.12.2018 1:28, Tomas Vondra wrote:
>>> 4) There was a problem with marking top-level transaction as having
>>> catalog changes if one of its subtransactions has. It was causing a
>>> problem with DDL statements just after subtransaction start (savepoint),
>>> so data from new columns is not replicated.
>>>
>>> 5) Similar issue with schema send. You send schema only once per each
>>> sub/transaction (IIRC), while we have to update schema on each catalog
>>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>>> it is easy to set it inside RB and read in the output plugin. Probably,
>>> we have to choose a better place for this flag.
>>>
>> Hmm. Can you share an example how to trigger these issues?
>
> Test cases inside 014_stream_tough_ddl.pl and old ones (with
> streaming=true option added) should reproduce all these issues. In
> general, it happens in a txn like:
>
> INSERT
> SAVEPOINT
> ALTER TABLE ... ADD COLUMN
> INSERT
>
> then the second insert may discover old version of catalog.
>
Yeah, that's the issue I've discovered before and thought it got fixed.
>> Interesting. Any idea where does the extra overhead in this particular
>> case come from? It's hard to deduce that from the single flame graph,
>> when I don't have anything to compare it with (i.e. the flame graph for
>> the "normal" case).
>
> I guess that bottleneck is in disk operations. You can check
> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
> writes (~26%) take around 35% of CPU time in summary. To compare,
> please, see attached flame graph for the following transaction:
>
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>
> Execution Time: 44519.816 ms
> Time: 98333,642 ms (01:38,334)
>
> where disk IO is only ~7-8% in total. So we get very roughly the same
> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>
> Therefore, probably you may write changes on receiver in bigger chunks,
> not each change separately.
>
Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.
BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?
>> So I'm not particularly worried, but I'll look into that. I'd be much
>> more worried if there was measurable overhead in cases when there's no
>> streaming happening (either because it's disabled or the memory limit
>> was not hit).
>
> What I have also just found, is that if a table row is large enough to
> be TOASTed, e.g.:
>
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
> then logical_work_mem limit is not hit and we neither stream, nor spill
> to disk this transaction, while it is still large. In contrast, the
> transaction above (with 1000000 smaller rows) being comparable in size
> is streamed. Not sure, that it is easy to add proper accounting of
> TOAST-able columns, but it worth it.
>
That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gz | application/gzip | 9.9 KB |
0002-Immediately-WAL-log-assignments-20181219.patch.gz | application/gzip | 6.4 KB |
0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gz | application/gzip | 4.9 KB |
0004-Extend-the-output-plugin-API-with-stream-me-20181219.patch.gz | application/gzip | 6.1 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch.gz | application/gzip | 12.3 KB |
0006-Add-support-for-streaming-to-built-in-repli-20181219.patch.gz | application/gzip | 20.1 KB |
0007-Track-statistics-for-streaming-spilling-20181219.patch.gz | application/gzip | 4.2 KB |
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181219.patch.gz | application/gzip | 635 bytes |
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2018-12-19 09:58:58 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Tomas,
> I'm a bit confused by the changes to TAP tests. Per the patch summary,
> some .pl files get renamed (nor sure why), a new one is added, etc.
I added new tap test case, streaming=true option inside old stream_*
ones and incremented streaming tests number (+2) because of the
collision between 009_matviews.pl / 009_stream_simple.pl and
010_truncate.pl / 010_stream_subxact.pl. At least in the previous
version of the patch they were under the same numbers. Nothing special,
but for simplicity, please, find attached my new tap test separately.
> So
> I've instead enabled streaming subscriptions in all tests, which with
> this patch produces two failures:
>
> Test Summary Report
> -------------------
> t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0)
> Non-zero exit status: 29
> Parse errors: Bad plan. You planned 7 tests but ran 1.
> t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1)
> Failed test: 2
> Non-zero exit status: 1
>
> So yeah, there's more stuff to fix. But I can't directly apply your
> fixes because the updated patches are somewhat different.
Fixes should apply clearly to the previous version of your patch. Also,
I am not sure, that it is a good idea to simply enable streaming
subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl),
since then they do not hit not streaming code.
>>> Interesting. Any idea where does the extra overhead in this particular
>>> case come from? It's hard to deduce that from the single flame graph,
>>> when I don't have anything to compare it with (i.e. the flame graph for
>>> the "normal" case).
>> I guess that bottleneck is in disk operations. You can check
>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>> writes (~26%) take around 35% of CPU time in summary. To compare,
>> please, see attached flame graph for the following transaction:
>>
>> INSERT INTO large_text
>> SELECT (SELECT string_agg('x', ',')
>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>
>> Execution Time: 44519.816 ms
>> Time: 98333,642 ms (01:38,334)
>>
>> where disk IO is only ~7-8% in total. So we get very roughly the same
>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>>
>> Therefore, probably you may write changes on receiver in bigger chunks,
>> not each change separately.
>>
> Possibly, I/O is certainly a possible culprit, although we should be
> using buffered I/O and there certainly are not any fsyncs here. So I'm
> not sure why would it be cheaper to do the writes in batches.
>
> BTW does this mean you see the overhead on the apply side? Or are you
> running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker are utilizing
almost 100% of CPU per each process all the time, and at apply side I/O
syscalls take about 1/3 of CPU time. Though I am still not sure, but for
me this result somehow links performance drop with problems at receiver
side.
Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance drop,
while it was streamed too. So it seems to be valid. Anyway, I do not
have other reasonable ideas beside that right now.
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
0xx_stream_tough_ddl.pl | application/x-perl | 3.7 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-01-14 18:23:31 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.
I'm a bit unhappy with two aspects of the current patch series:
1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.
2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190114.patch.gz | application/gzip | 9.9 KB |
0002-Immediately-WAL-log-assignments-20190114.patch.gz | application/gzip | 6.4 KB |
0003-Issue-individual-invalidations-with-wal_lev-20190114.patch.gz | application/gzip | 4.9 KB |
0004-Extend-the-output-plugin-API-with-stream-me-20190114.patch.gz | application/gzip | 6.1 KB |
0005-Implement-streaming-mode-in-ReorderBuffer-20190114.patch.gz | application/gzip | 12.8 KB |
0006-Add-support-for-streaming-to-built-in-repli-20190114.patch.gz | application/gzip | 19.8 KB |
0007-Track-statistics-for-streaming-spilling-20190114.patch.gz | application/gzip | 4.2 KB |
0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch.gz | application/gzip | 1.8 KB |
0009-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190114.patch.gz | application/gzip | 639 bytes |
0010-Add-TAP-test-for-streaming-vs.-DDL-20190114.patch.gz | application/gzip | 1.6 KB |
From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-02-04 05:51:31 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.
Patch 4 of the latest set fails to apply, so I have moved the patch to
next CF, waiting on author.
--
Michael
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-02-04 13:49:08 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Tomas,
On 14.01.2019 21:23, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.
I had problems applying this patch along with 2pc streaming one to the
current master, but everything applied well on 97c39498e5. Regression
tests pass. What I personally do not like in the current TAP tests set
is that you have added "WITH (streaming=on)" to all tests including old
non-streaming ones. It seems unclear, which mechanism is tested there:
streaming, but those transactions probably do not hit memory limit, so
it depends on default server parameters; or non-streaming, but then what
is the need for (streaming=on)? I would prefer to add (streaming=on)
only to the new tests, where it is clearly necessary.
> I'm a bit unhappy with two aspects of the current patch series:
>
> 1) We now track schema changes in two ways - using the pre-existing
> schema_sent flag in RelationSyncEntry, and the (newly added) flag in
> ReorderBuffer. While those options are used for regular vs. streamed
> transactions, fundamentally it's the same thing and so having two
> competing ways seems like a bad idea. Not sure what's the best way to
> resolve this, though.
Yes, sure, when I have found problems with streaming of extensive DDL, I
added new flag in the simplest way, and it worked. Now, old schema_sent
flag is per relation based, while the new one - is_schema_sent - is per
top-level transaction based. If I get it correctly, the former seems to
be more thrifty, since new schema is sent only if we are streaming
change for relation, whose schema is outdated. In contrast, in the
latter case we will send new schema even if there will be no new changes
which belong to this relation.
I guess, it would be better to stick to the old behavior. I will try to
investigate how to better use it in the streaming mode as well.
> 2) We've removed quite a few asserts, particularly ensuring sanity of
> cmin/cmax values. To some extent that's expected, because by allowing
> decoding of in-progress transactions relaxes some of those rules. But
> I'd be much happier if some of those asserts could be reinstated, even
> if only in a weaker form.
Asserts have been removed from two places: (1)
HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are
touching the essence of the MVCC visibility rules, when trying to decode
an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash,
which is probably not related directly to the topic of the ongoing
patch, since Arseny Sher faced the same issue with simple repetitive DDL
decoding [1] recently.
Not many, but I agree, that replacing them with some softer asserts
would be better, than just removing, especially point 1).
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-08-28 17:17:47 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Tomas,
>>>> Interesting. Any idea where does the extra overhead in this particular
>>>> case come from? It's hard to deduce that from the single flame graph,
>>>> when I don't have anything to compare it with (i.e. the flame graph
>>>> for
>>>> the "normal" case).
>>> I guess that bottleneck is in disk operations. You can check
>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>> please, see attached flame graph for the following transaction:
>>>
>>> INSERT INTO large_text
>>> SELECT (SELECT string_agg('x', ',')
>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>
>>> Execution Time: 44519.816 ms
>>> Time: 98333,642 ms (01:38,334)
>>>
>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for
>>> tests.
>>>
>>> Therefore, probably you may write changes on receiver in bigger chunks,
>>> not each change separately.
>>>
>> Possibly, I/O is certainly a possible culprit, although we should be
>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>> not sure why would it be cheaper to do the writes in batches.
>>
>> BTW does this mean you see the overhead on the apply side? Or are you
>> running this on a single machine, and it's difficult to decide?
>
> I run this on a single machine, but walsender and worker are utilizing
> almost 100% of CPU per each process all the time, and at apply side
> I/O syscalls take about 1/3 of CPU time. Though I am still not sure,
> but for me this result somehow links performance drop with problems at
> receiver side.
>
> Writing in batches was just a hypothesis and to validate it I have
> performed test with large txn, but consisting of a smaller number of
> wide rows. This test does not exhibit any significant performance
> drop, while it was streamed too. So it seems to be valid. Anyway, I do
> not have other reasonable ideas beside that right now.
I've checked recently this patch again and tried to elaborate it in
terms of performance. As a result I've implemented a new POC version of
the applier (attached). Almost everything in streaming logic stayed
intact, but apply worker is significantly different.
As I wrote earlier I still claim, that spilling changes on disk at the
applier side adds additional overhead, but it is possible to get rid of
it. In my additional patch I do the following:
1) Maintain a pool of additional background workers (bgworkers), that
are connected with main logical apply worker via shm_mq's. Each worker
is dedicated to the processing of specific streamed transaction.
2) When we receive a streamed change for some transaction, we check
whether there is an existing dedicated bgworker in HTAB (xid ->
bgworker), or there are some in the idle list, or spawn a new one.
3) We pass all changes (between STREAM START/STOP) to that bgworker via
shm_mq_send without intermediate waiting. However, we wait for bgworker
to apply the entire changes chunk at STREAM STOP, since we don't want
transactions reordering.
4) When transaction is commited/aborted worker is being added to the
idle list and is waiting for reassigning message.
5) I have used the same machinery with apply_dispatch in bgworkers,
since most of actions are practically very similar.
Thus, we do not spill anything at the applier side, so transaction
changes are processed by bgworkers as normal backends do. In the same
time, changes processing is strictly serial, which prevents transactions
reordering and possible conflicts/anomalies. Even though we trade off
performance in favor of stability the result is rather impressive. I
have used a similar query for testing as before:
EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);
with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
following:
CREATE TABLE large_test (
id serial primary key,
num1 bigint,
num2 double precision,
num3 double precision
);
Here are the results:
-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec | Ratio |
-------------------------------------------------------------------
| On commit (master, v13) |
-------------------------------------------------------------------
| 1kk | 6.5 | 17.6 | x2.74 |
-------------------------------------------------------------------
| 3kk | 21 | 55.4 | x2.64 |
-------------------------------------------------------------------
| 5kk | 38.3 | 91.5 | x2.39 |
-------------------------------------------------------------------
| Stream + spill |
-------------------------------------------------------------------
| 1kk | 5.9 | 18 | x3 |
-------------------------------------------------------------------
| 3kk | 19.5 | 52.4 | x2.7 |
-------------------------------------------------------------------
| 5kk | 33.3 | 86.7 | x2.86 |
-------------------------------------------------------------------
| Stream + BGW pool |
-------------------------------------------------------------------
| 1kk | 6 | 12 | x2 |
-------------------------------------------------------------------
| 3kk | 18.5 | 30.5 | x1.65 |
-------------------------------------------------------------------
| 5kk | 35.6 | 53.9 | x1.51 |
-------------------------------------------------------------------
It seems that overhead added by synchronous replica is lower by 2-3
times compared with Postgres master and streaming with spilling.
Therefore, the original patch eliminated delay before large transaction
processing start by sender, while this additional patch speeds up the
applier side.
Although the overall speed up is surely measurable, there is a room for
improvements yet:
1) Currently bgworkers are only spawned on demand without some initial
pool and never stopped. Maybe we should create a small pool on
replication start and offload some of idle bgworkers if they exceed some
limit?
2) Probably we can track somehow that incoming change has conflicts with
some of being processed xacts, so we can wait for specific bgworkers
only in that case?
3) Since the communication between main logical apply worker and each
bgworker from the pool is a 'single producer --- single consumer'
problem, then probably it is possible to wait and set/check flags
without locks, but using just atomics.
What do you think about this concept in general? Any concerns and
criticism are welcome!
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc patch, that I don't know well enough.
Attachment | Content-Type | Size |
---|---|---|
0011-BGWorkers-pool-for-streamed-transactions-apply-witho.patch | text/x-patch | 59.9 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-08-28 19:06:46 |
Message-ID: | 20190828190646.s3bjs22urff32k56@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:
>Hi Tomas,
>
>>>>>Interesting. Any idea where does the extra overhead in this particular
>>>>>case come from? It's hard to deduce that from the single flame graph,
>>>>>when I don't have anything to compare it with (i.e. the flame
>>>>>graph for
>>>>>the "normal" case).
>>>>I guess that bottleneck is in disk operations. You can check
>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>please, see attached flame graph for the following transaction:
>>>>
>>>>INSERT INTO large_text
>>>>SELECT (SELECT string_agg('x', ',')
>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>
>>>>Execution Time: 44519.816 ms
>>>>Time: 98333,642 ms (01:38,334)
>>>>
>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>~x4-5 performance drop here. JFYI, I am using a machine with SSD
>>>>for tests.
>>>>
>>>>Therefore, probably you may write changes on receiver in bigger chunks,
>>>>not each change separately.
>>>>
>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>not sure why would it be cheaper to do the writes in batches.
>>>
>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>running this on a single machine, and it's difficult to decide?
>>
>>I run this on a single machine, but walsender and worker are
>>utilizing almost 100% of CPU per each process all the time, and at
>>apply side I/O syscalls take about 1/3 of CPU time. Though I am
>>still not sure, but for me this result somehow links performance
>>drop with problems at receiver side.
>>
>>Writing in batches was just a hypothesis and to validate it I have
>>performed test with large txn, but consisting of a smaller number of
>>wide rows. This test does not exhibit any significant performance
>>drop, while it was streamed too. So it seems to be valid. Anyway, I
>>do not have other reasonable ideas beside that right now.
>
>I've checked recently this patch again and tried to elaborate it in
>terms of performance. As a result I've implemented a new POC version
>of the applier (attached). Almost everything in streaming logic stayed
>intact, but apply worker is significantly different.
>
>As I wrote earlier I still claim, that spilling changes on disk at the
>applier side adds additional overhead, but it is possible to get rid
>of it. In my additional patch I do the following:
>
>1) Maintain a pool of additional background workers (bgworkers), that
>are connected with main logical apply worker via shm_mq's. Each worker
>is dedicated to the processing of specific streamed transaction.
>
>2) When we receive a streamed change for some transaction, we check
>whether there is an existing dedicated bgworker in HTAB (xid ->
>bgworker), or there are some in the idle list, or spawn a new one.
>
>3) We pass all changes (between STREAM START/STOP) to that bgworker
>via shm_mq_send without intermediate waiting. However, we wait for
>bgworker to apply the entire changes chunk at STREAM STOP, since we
>don't want transactions reordering.
>
>4) When transaction is commited/aborted worker is being added to the
>idle list and is waiting for reassigning message.
>
>5) I have used the same machinery with apply_dispatch in bgworkers,
>since most of actions are practically very similar.
>
>Thus, we do not spill anything at the applier side, so transaction
>changes are processed by bgworkers as normal backends do. In the same
>time, changes processing is strictly serial, which prevents
>transactions reordering and possible conflicts/anomalies. Even though
>we trade off performance in favor of stability the result is rather
>impressive. I have used a similar query for testing as before:
>
>EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
>
>with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
>synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
>following:
>
>CREATE TABLE large_test (
> id serial primary key,
> num1 bigint,
> num2 double precision,
> num3 double precision
>);
>
>Here are the results:
>
>-------------------------------------------------------------------
>| N | Time on master, sec | Total xact time, sec | Ratio |
>-------------------------------------------------------------------
>| On commit (master, v13) |
>-------------------------------------------------------------------
>| 1kk | 6.5 | 17.6 | x2.74 |
>-------------------------------------------------------------------
>| 3kk | 21 | 55.4 | x2.64 |
>-------------------------------------------------------------------
>| 5kk | 38.3 | 91.5 | x2.39 |
>-------------------------------------------------------------------
>| Stream + spill |
>-------------------------------------------------------------------
>| 1kk | 5.9 | 18 | x3 |
>-------------------------------------------------------------------
>| 3kk | 19.5 | 52.4 | x2.7 |
>-------------------------------------------------------------------
>| 5kk | 33.3 | 86.7 | x2.86 |
>-------------------------------------------------------------------
>| Stream + BGW pool |
>-------------------------------------------------------------------
>| 1kk | 6 | 12 | x2 |
>-------------------------------------------------------------------
>| 3kk | 18.5 | 30.5 | x1.65 |
>-------------------------------------------------------------------
>| 5kk | 35.6 | 53.9 | x1.51 |
>-------------------------------------------------------------------
>
>It seems that overhead added by synchronous replica is lower by 2-3
>times compared with Postgres master and streaming with spilling.
>Therefore, the original patch eliminated delay before large
>transaction processing start by sender, while this additional patch
>speeds up the applier side.
>
>Although the overall speed up is surely measurable, there is a room
>for improvements yet:
>
>1) Currently bgworkers are only spawned on demand without some initial
>pool and never stopped. Maybe we should create a small pool on
>replication start and offload some of idle bgworkers if they exceed
>some limit?
>
>2) Probably we can track somehow that incoming change has conflicts
>with some of being processed xacts, so we can wait for specific
>bgworkers only in that case?
>
>3) Since the communication between main logical apply worker and each
>bgworker from the pool is a 'single producer --- single consumer'
>problem, then probably it is possible to wait and set/check flags
>without locks, but using just atomics.
>
>What do you think about this concept in general? Any concerns and
>criticism are welcome!
>
Hi Alexey,
I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.
I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.
There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.
Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.
Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.
That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.
Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-08-29 14:37:45 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 28.08.2019 22:06, Tomas Vondra wrote:
>
>>
>>>>>> Interesting. Any idea where does the extra overhead in this
>>>>>> particular
>>>>>> case come from? It's hard to deduce that from the single flame
>>>>>> graph,
>>>>>> when I don't have anything to compare it with (i.e. the flame
>>>>>> graph for
>>>>>> the "normal" case).
>>>>> I guess that bottleneck is in disk operations. You can check
>>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>> please, see attached flame graph for the following transaction:
>>>>>
>>>>> INSERT INTO large_text
>>>>> SELECT (SELECT string_agg('x', ',')
>>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>
>>>>> Execution Time: 44519.816 ms
>>>>> Time: 98333,642 ms (01:38,334)
>>>>>
>>>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD
>>>>> for tests.
>>>>>
>>>>> Therefore, probably you may write changes on receiver in bigger
>>>>> chunks,
>>>>> not each change separately.
>>>>>
>>>> Possibly, I/O is certainly a possible culprit, although we should be
>>>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>> not sure why would it be cheaper to do the writes in batches.
>>>>
>>>> BTW does this mean you see the overhead on the apply side? Or are you
>>>> running this on a single machine, and it's difficult to decide?
>>>
>>> I run this on a single machine, but walsender and worker are
>>> utilizing almost 100% of CPU per each process all the time, and at
>>> apply side I/O syscalls take about 1/3 of CPU time. Though I am
>>> still not sure, but for me this result somehow links performance
>>> drop with problems at receiver side.
>>>
>>> Writing in batches was just a hypothesis and to validate it I have
>>> performed test with large txn, but consisting of a smaller number of
>>> wide rows. This test does not exhibit any significant performance
>>> drop, while it was streamed too. So it seems to be valid. Anyway, I
>>> do not have other reasonable ideas beside that right now.
>>
>> It seems that overhead added by synchronous replica is lower by 2-3
>> times compared with Postgres master and streaming with spilling.
>> Therefore, the original patch eliminated delay before large
>> transaction processing start by sender, while this additional patch
>> speeds up the applier side.
>>
>> Although the overall speed up is surely measurable, there is a room
>> for improvements yet:
>>
>> 1) Currently bgworkers are only spawned on demand without some
>> initial pool and never stopped. Maybe we should create a small pool
>> on replication start and offload some of idle bgworkers if they
>> exceed some limit?
>>
>> 2) Probably we can track somehow that incoming change has conflicts
>> with some of being processed xacts, so we can wait for specific
>> bgworkers only in that case?
>>
>> 3) Since the communication between main logical apply worker and each
>> bgworker from the pool is a 'single producer --- single consumer'
>> problem, then probably it is possible to wait and set/check flags
>> without locks, but using just atomics.
>>
>> What do you think about this concept in general? Any concerns and
>> criticism are welcome!
>>
>
Hi Tomas,
Thank you for a quick response.
> I don't think it matters very much whether the workers are started at the
> beginning or allocated ad hoc, that's IMO a minor implementation detail.
OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.
>
> There's one huge challenge that I however don't see mentioned in your
> message or in the patch (after cursory reading) - ensuring the same
> commit
> order, and introducing deadlocks that would not exist in single-process
> apply.
Probably I haven't explained well this part, sorry for that. In my patch
I don't use workers pool for a concurrent transaction apply, but rather
for a fast context switch between long-lived streamed transactions. In
other words we apply all changes arrived from the sender in a completely
serial manner. Being written step-by-step it looks like:
1) Read STREAM START message and figure out the target worker by xid.
2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.
3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.
4) Process all other chunks of streamed xacts in the same manner.
5) Process all non-streamed xacts immediately in the main apply worker loop.
6) If we read STREAMED COMMIT/ABORT we again wait until selected worker
either commits or aborts.
Thus, it automatically guaranties the same commit order on replica as on
master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.
However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of streamed
xacts may reorder. It happens if some small xacts have been commited on
master since the streamed one started, because we do not start streaming
immediately, but only after logical_work_mem hit. I have performed some
tests with conflicting xacts and it seems that it's not a problem, since
locking mechanism in Postgres guarantees that if there would some
deadlocks, they will happen earlier on master. So if some records hit
the WAL, it is safe to apply the sequentially. Am I wrong?
Anyway, I'm going to double check the safety of this part later.
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-08-29 18:48:24 |
Message-ID: | 20190829184824.kmrbchrk2ged6vjw@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
>On 28.08.2019 22:06, Tomas Vondra wrote:
>>
>>>
>>>>>>>Interesting. Any idea where does the extra overhead in
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with
>>>>>>SSD for tests.
>>>>>>
>>>>>>Therefore, probably you may write changes on receiver in
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>>
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>
>>>>I run this on a single machine, but walsender and worker are
>>>>utilizing almost 100% of CPU per each process all the time, and
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I
>>>>am still not sure, but for me this result somehow links
>>>>performance drop with problems at receiver side.
>>>>
>>>>Writing in batches was just a hypothesis and to validate it I
>>>>have performed test with large txn, but consisting of a smaller
>>>>number of wide rows. This test does not exhibit any significant
>>>>performance drop, while it was streamed too. So it seems to be
>>>>valid. Anyway, I do not have other reasonable ideas beside that
>>>>right now.
>>>
>>>It seems that overhead added by synchronous replica is lower by
>>>2-3 times compared with Postgres master and streaming with
>>>spilling. Therefore, the original patch eliminated delay before
>>>large transaction processing start by sender, while this
>>>additional patch speeds up the applier side.
>>>
>>>Although the overall speed up is surely measurable, there is a
>>>room for improvements yet:
>>>
>>>1) Currently bgworkers are only spawned on demand without some
>>>initial pool and never stopped. Maybe we should create a small
>>>pool on replication start and offload some of idle bgworkers if
>>>they exceed some limit?
>>>
>>>2) Probably we can track somehow that incoming change has
>>>conflicts with some of being processed xacts, so we can wait for
>>>specific bgworkers only in that case?
>>>
>>>3) Since the communication between main logical apply worker and
>>>each bgworker from the pool is a 'single producer --- single
>>>consumer' problem, then probably it is possible to wait and
>>>set/check flags without locks, but using just atomics.
>>>
>>>What do you think about this concept in general? Any concerns and
>>>criticism are welcome!
>>>
>>
>
>Hi Tomas,
>
>Thank you for a quick response.
>
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>
>OK, I had the same vision about this point. Any minor differences here
>will be neglectable for a sufficiently large transaction.
>
>>
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same
>>commit
>>order, and introducing deadlocks that would not exist in single-process
>>apply.
>
>Probably I haven't explained well this part, sorry for that. In my
>patch I don't use workers pool for a concurrent transaction apply, but
>rather for a fast context switch between long-lived streamed
>transactions. In other words we apply all changes arrived from the
>sender in a completely serial manner. Being written step-by-step it
>looks like:
>
>1) Read STREAM START message and figure out the target worker by xid.
>
>2) Put all changes, which belongs to this xact to the selected worker
>one by one via shm_mq_send.
>
>3) Read STREAM STOP message and wait until our worker will apply all
>changes in the queue.
>
>4) Process all other chunks of streamed xacts in the same manner.
>
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>
>6) If we read STREAMED COMMIT/ABORT we again wait until selected
>worker either commits or aborts.
>
>Thus, it automatically guaranties the same commit order on replica as
>on master. Yes, we loose some performance here, since we don't apply
>transactions concurrently, but it would bring all those problems you
>have described.
>
OK, so it's apply in multiple processes, but at any moment only a single
apply process is active.
>However, you helped me to figure out another point I have forgotten.
>Although we ensure commit order automatically, the beginning of
>streamed xacts may reorder. It happens if some small xacts have been
>commited on master since the streamed one started, because we do not
>start streaming immediately, but only after logical_work_mem hit. I
>have performed some tests with conflicting xacts and it seems that
>it's not a problem, since locking mechanism in Postgres guarantees
>that if there would some deadlocks, they will happen earlier on
>master. So if some records hit the WAL, it is safe to apply the
>sequentially. Am I wrong?
>
I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.
>Anyway, I'm going to double check the safety of this part later.
>
OK.
FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-08-30 15:59:32 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>
> FWIW my understanding is that the speedup comes mostly from
> elimination of
> the serialization to a file. That however requires savepoints to handle
> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
> workload where this will be much slower (with many aborts of large
> subtransactions).
>
>
I think that instead of defining savepoints it is simpler and more
efficient to use
BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-02 22:06:50 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
In the interest of moving things forward, how far are we from making
0001 committable? If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-03 10:39:09 |
Message-ID: | 20190903103909.saxp6on62wvh5rqt@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>In the interest of moving things forward, how far are we from making
>0001 committable? If I understand correctly, the rest of this patchset
>depends on https://commitfest.postgresql.org/24/944/ which seems to be
>moving at a glacial pace (or, actually, slower, because glaciers do
>move, which cannot be said of that other patch.)
>
I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.
FWIW maybe we should stop comparing things to glaciers. 50 years from not
people won't know what a glacier is, and it'll be just like the floppy
icon on the save button.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-16 16:54:32 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>>
>> FWIW my understanding is that the speedup comes mostly from
>> elimination of
>> the serialization to a file. That however requires savepoints to handle
>> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
>> workload where this will be much slower (with many aborts of large
>> subtransactions).
>>
Yes, and it was my main motivation to eliminate that extra serialization
to file. I've experimented a bit with large transactions + savepoints +
aborts and ended up with a following query (the same schema as before
with 600k rows):
BEGIN;
SAVEPOINT s1;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s2;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s3;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
ROLLBACK TO SAVEPOINT s3;
ROLLBACK TO SAVEPOINT s2;
ROLLBACK TO SAVEPOINT s1;
END;
It looks like the worst case scenario, as we do a lot of work and then
abort all subxacts one by one. As expected,it takes much longer (up to
x30) to process using background worker instead of spilling to file.
Surely, it is much easier to truncate a file, than apply all changes +
abort. However, I guess that this kind of load pattern is not the most
typical for real-life applications.
Also this test helped me to find a bug in my current savepoints routine,
so new patch is attached.
On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>
> I think that instead of defining savepoints it is simpler and more
> efficient to use
>
> BeginInternalSubTransaction +
> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>
> as it is done in PL/pgSQL (pl_exec.c).
> Not sure if it can pr
>
Both BeginInternalSubTransaction and DefineSavepoint use
PushTransaction() internally for a normal subtransaction start. So they
seems to be identical from the performance perspective, which is also
stated in the comment section:
/*
* BeginInternalSubTransaction
* This is the same as DefineSavepoint except it allows
TBLOCK_STARTED,
* TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE
states,
* and therefore it can safely be used in functions that might
be called
* when not inside a BEGIN block or when running deferred
triggers at
* COMMIT/PREPARE time. Also, it automatically does
* CommitTransactionCommand/StartTransactionCommand instead of
expecting
* the caller to do it.
*/
Please, correct me if I'm wrong.
Anyway, I've performed a profiling of my apply worker (flamegraph is
attached) and it spends the vast amount of time (>90%) applying changes.
So the problem is not in the savepoints their-self, but in the fact that
we first apply all changes and then abort all the work. Not sure, that
it is possible to do something in this case.
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
worker_aborts_perf.zip | application/zip | 69.8 KB |
v2-0011-BGWorkers-pool-for-streamed-transactions-apply-wi.patch | text/x-patch | 60.1 KB |
From: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-16 19:29:18 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 16.09.2019 19:54, Alexey Kondratov wrote:
> On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>
>> I think that instead of defining savepoints it is simpler and more
>> efficient to use
>>
>> BeginInternalSubTransaction +
>> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>
>> as it is done in PL/pgSQL (pl_exec.c).
>> Not sure if it can pr
>>
>
> Both BeginInternalSubTransaction and DefineSavepoint use
> PushTransaction() internally for a normal subtransaction start. So
> they seems to be identical from the performance perspective, which is
> also stated in the comment section:
Yes, definitely them are using the same mechanism and most likely
provides similar performance.
But BeginInternalSubTransaction does not require to generate some
savepoint name which seems to be redundant in this case.
>
> Anyway, I've performed a profiling of my apply worker (flamegraph is
> attached) and it spends the vast amount of time (>90%) applying
> changes. So the problem is not in the savepoints their-self, but in
> the fact that we first apply all changes and then abort all the work.
> Not sure, that it is possible to do something in this case.
>
Looks like the only way to increase apply speed is to do it in parallel:
make it possible to concurrently execute non-conflicting transactions.
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Cc: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-16 20:57:19 |
Message-ID: | 20190916205719.mnyytrsredcuf2or@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote:
>
>
>On 16.09.2019 19:54, Alexey Kondratov wrote:
>>On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>>
>>>I think that instead of defining savepoints it is simpler and more
>>>efficient to use
>>>
>>>BeginInternalSubTransaction +
>>>ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>>
>>>as it is done in PL/pgSQL (pl_exec.c).
>>>Not sure if it can pr
>>>
>>
>>Both BeginInternalSubTransaction and DefineSavepoint use
>>PushTransaction() internally for a normal subtransaction start. So
>>they seems to be identical from the performance perspective, which
>>is also stated in the comment section:
>
>Yes, definitely them are using the same mechanism and most likely
>provides similar performance.
>But BeginInternalSubTransaction does not require to generate some
>savepoint name which seems to be redundant in this case.
>
>
>>
>>Anyway, I've performed a profiling of my apply worker (flamegraph is
>>attached) and it spends the vast amount of time (>90%) applying
>>changes. So the problem is not in the savepoints their-self, but in
>>the fact that we first apply all changes and then abort all the
>>work. Not sure, that it is possible to do something in this case.
>>
>
>Looks like the only way to increase apply speed is to do it in
>parallel: make it possible to concurrently execute non-conflicting
>transactions.
>
True, although it seems like a massive can of worms to me. I'm not aware
a way to identify non-conflicting transactions in advance, so it would
have to be implemented as optimistic apply, with a detection and
recovery from conflicts.
I'm not against doing that, and I'm willing to spend some time on revies
etc. but it seems like a completely separate effort.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-25 13:25:01 |
Message-ID: | CAA4eK1KtxLpSp2rP6Rt8izQnPmhiA=2QUpLk+voagTjKowc0HA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>
> In the interest of moving things forward, how far are we from making
> 0001 committable? If I understand correctly, the rest of this patchset
> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> moving at a glacial pace (or, actually, slower, because glaciers do
> move, which cannot be said of that other patch.)
>
I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry. I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts. It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1]. Am, I wrong?
So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2]. To solve that problem Nikhil
seems to have written a patch [1] which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.
Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3]. So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it. Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment. It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.
I hope I am not saying something very obvious here and it helps in
moving this patch forward.
Thoughts?
[1] - https://www.postgresql.org/message-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW%3DUo70iBY3P_EPdp%2BLTQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2%40postgrespro.ru
[3] - https://www.postgresql.org/message-id/87a7py4iwl.fsf%40ars-thinkpad
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 13:28:17 |
Message-ID: | CAA4eK1+pFyM99vtYxmj=vqqpnoeS+nWGT=3bQ5yUNQjrsmK-fA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
> >In the interest of moving things forward, how far are we from making
> >0001 committable? If I understand correctly, the rest of this patchset
> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >moving at a glacial pace (or, actually, slower, because glaciers do
> >move, which cannot be said of that other patch.)
> >
>
> I think 0001 is mostly there. I think there's one bug in this patch
> version, but I need to check and I'll post an updated version shortly if
> needed.
>
Did you get a chance to work on 0001? I have a few comments on that patch:
1.
+ * To limit the amount of memory used by decoded changes, we track memory
+ * used at the reorder buffer level (i.e. total amount of memory), and for
+ * each toplevel transaction. When the total amount of used memory exceeds
+ * the limit, the toplevel transaction consuming the most memory is either
+ * serialized or streamed.
Do we need to mention 'streamed' as part of this patch? It seems to
me that this is an independent patch which can be committed without
patches that stream the changes. So, we can remove it from here and
other places where it is used.
2.
+ * deserializing and applying very few changes). We probably to give more
+ * memory to the oldest subtransactions.
/We probably to/
It seems some word is missing after probably.
3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit? Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case. Instead, the strategy mentioned in point (c) or something like
that seems more promising. In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts). In this, we also need to ensure that the total memory freed
must be greater than the current change.
I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.
4.
+int logical_work_mem; /* 4MB */
What this 4MB in comments indicate?
5.
+/*
+ * Check whether the logical_work_mem limit was reached, and if yes pick
+ * the transaction tx should spill its data to disk.
The second part of the sentence "pick the transaction tx should spill"
seems to be incomplete.
Apart from this, I see that Peter E. has raised some other points on
this patch which are not yet addressed as those also need some
discussion, so I will respond to those separately with my opinion.
These comments are based on the last patch posted by you on this
thread [1]. You might have fixed some of these already, so ignore if
that is the case.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 18:07:52 |
Message-ID: | 20190926175845.pau2qocmlffd37qg@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).
I've also included the patch series with decoding of 2PC transactions,
which this depends on. This way we have a chance of making the cfbot
happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while
0005-0008 are the extra pieces from the other patch.
I've done it like this because the initial parts are independent, and so
might be committed irrespectedly of the other patch series. In practice
that's only reasonable for 0001, which adds the memory limit - the rest
is infrastucture for the streaming of in-progress transactions.
On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
>On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>>
>> In the interest of moving things forward, how far are we from making
>> 0001 committable? If I understand correctly, the rest of this patchset
>> depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> moving at a glacial pace (or, actually, slower, because glaciers do
>> move, which cannot be said of that other patch.)
>>
>
>I am not sure if it is completely correct that the other part of the
>patch is dependent on that CF entry. I have studied both the threads
>(not every detail) and it seems to me it is dependent on one of the
>patches from that series which handles concurrent aborts. It is patch
>0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
>from what the Nikhil has posted on that thread [1]. Am, I wrong?
>
You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).
>So IIUC, the problem of concurrent aborts is that if we allow catalog
>scans for in-progress transactions, then we might get wrong answers in
>cases where somebody has performed Alter-Abort-Alter which is clearly
>explained with an example in email [2]. To solve that problem Nikhil
>seems to have written a patch [1] which detects these concurrent
>aborts during a system table scan and then aborts the decoding of such
>a transaction.
>
>Now, the problem is that patch has written considering 2PC
>transactions and might not deal with all cases for in-progress
>transactions especially when sub-transactions are involved as alluded
>by Arseny Sher [3]. So, the problem seems to be for cases when some
>sub-transaction aborts, but the main transaction still continued and
>we try to decode it. Nikhil's patch won't be able to deal with it
>because I think it just checks top-level xid whereas for this we need
>to check all-subxids which I think is possible now as Tomas seems to
>have written WAL for each xid-assignment. It might or might not be
>the best solution to check the status of all-subxids, but I think
>first we need to agree that the problem is just for concurrent aborts
>and that we can solve it by using some part of the technology being
>developed as part of patch "Logical decoding of two-phase
>transactions" (https://commitfest.postgresql.org/24/944/) rather than
>the entire patchset.
>
>I hope I am not saying something very obvious here and it helps in
>moving this patch forward.
>
No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.
Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 18:36:45 |
Message-ID: | 20190926183645.6getukzf4dxxclp2@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:
>On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>> >In the interest of moving things forward, how far are we from making
>> >0001 committable? If I understand correctly, the rest of this patchset
>> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> >moving at a glacial pace (or, actually, slower, because glaciers do
>> >move, which cannot be said of that other patch.)
>> >
>>
>> I think 0001 is mostly there. I think there's one bug in this patch
>> version, but I need to check and I'll post an updated version shortly if
>> needed.
>>
>
>Did you get a chance to work on 0001? I have a few comments on that patch:
>1.
>+ * To limit the amount of memory used by decoded changes, we track memory
>+ * used at the reorder buffer level (i.e. total amount of memory), and for
>+ * each toplevel transaction. When the total amount of used memory exceeds
>+ * the limit, the toplevel transaction consuming the most memory is either
>+ * serialized or streamed.
>
>Do we need to mention 'streamed' as part of this patch? It seems to
>me that this is an independent patch which can be committed without
>patches that stream the changes. So, we can remove it from here and
>other places where it is used.
>
You're right - this patch should not mention streaming because the parts
adding that capability are later in the series. So it can trigger just
the serialization to disk.
>2.
>+ * deserializing and applying very few changes). We probably to give more
>+ * memory to the oldest subtransactions.
>
>/We probably to/
>It seems some word is missing after probably.
>
Yes.
>3.
>+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
>+ *
>+ * XXX With many subtransactions this might be quite slow, because we'll have
>+ * to walk through all of them. There are some options how we could improve
>+ * that: (a) maintain some secondary structure with transactions sorted by
>+ * amount of changes, (b) not looking for the entirely largest transaction,
>+ * but e.g. for transaction using at least some fraction of the memory limit,
>+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
>+ * of the memory limit (e.g. 50%).
>+ */
>+static ReorderBufferTXN *
>+ReorderBufferLargestTXN(ReorderBuffer *rb)
>
>What is the guarantee that after evicting largest transaction, we
>won't immediately hit the memory limit? Say, all of the transactions
>are of almost similar size which I don't think is that uncommon a
>case.
Not sure I understand - what do you mean 'immediately hit'?
We do check the limit after queueing a change, and we know that this
change is what got us over the limit. We pick the largest transaction
(which has to be larger than the change we just entered) and evict it,
getting below the memory limit again.
The next change can get us over the memory limit again, of course, but
there's not much we could do about that.
> Instead, the strategy mentioned in point (c) or something like
>that seems more promising. In that strategy, there is some risk that
>it might lead to many smaller disk writes which we might want to
>control via some threshold (like we should not flush more than N
>xacts). In this, we also need to ensure that the total memory freed
>must be greater than the current change.
>
>I think we have some discussion around this point but didn't reach any
>conclusion which means some more brainstorming is required.
>
I agree it's worth investigating, but I'm not sure it's necessary before
committing v1 of the feature. I don't think there's a clear winner
strategy, and the current approach works fairly well I think.
The comment is concerned with the cost of ReorderBufferLargestTXN with
many transactions, but we can only have certain number of top-level
transactions (max_connections + certain number of not-yet-assigned
subtransactions). And 0002 patch essentially gets rid of the subxacts
entirely, further reducing the maximum number of xacts to walk.
>4.
>+int logical_work_mem; /* 4MB */
>
>What this 4MB in comments indicate?
>
Sorry, that's a mistake.
>5.
>+/*
>+ * Check whether the logical_work_mem limit was reached, and if yes pick
>+ * the transaction tx should spill its data to disk.
>
>The second part of the sentence "pick the transaction tx should spill"
>seems to be incomplete.
>
Yeah, that's a poor wording. Will fix.
>Apart from this, I see that Peter E. has raised some other points on
>this patch which are not yet addressed as those also need some
>discussion, so I will respond to those separately with my opinion.
>
OK, thanks.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 19:33:59 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2019-Sep-26, Tomas Vondra wrote:
> Hi,
>
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).
Cool.
Can we aim to get 0001 pushed during this commitfest, or is that a lost
cause?
The large new comment in reorderbuffer.c says that a transaction might
get spilled *or streamed*, but surely that second thing is not correct,
since before the subsequent patches it's not possible to stream
transactions that have not yet finished?
How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop? I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 19:36:20 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2019-Sep-26, Alvaro Herrera wrote:
> How certain are you about the approach to measure memory used by a
> reorderbuffer transaction ... does it not cause a measurable performance
> drop? I wonder if it would make more sense to use a separate contexts
> per transaction and use context-level accounting (per the patch Jeff
> Davis posted elsewhere for hash joins ... though I see now that that
> only works fot aset.c, not other memcxt implementations), or something
> like that.
Oh, I just noticed that that patch was posted separately in its own
thread, and that that improved version does include support for other
memory context implementations. Excellent.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 21:24:55 |
Message-ID: | 20190926212455.p6lp3q3lg7jske57@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 04:36:20PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-26, Alvaro Herrera wrote:
>
>> How certain are you about the approach to measure memory used by a
>> reorderbuffer transaction ... does it not cause a measurable performance
>> drop? I wonder if it would make more sense to use a separate contexts
>> per transaction and use context-level accounting (per the patch Jeff
>> Davis posted elsewhere for hash joins ... though I see now that that
>> only works fot aset.c, not other memcxt implementations), or something
>> like that.
>
>Oh, I just noticed that that patch was posted separately in its own
>thread, and that that improved version does include support for other
>memory context implementations. Excellent.
>
Unfortunately, that won't fly, for two simple reasons:
1) The memory accounting patch is known to perform poorly with many
child contexts - this was why array_agg/string_agg were problematic,
before we rewrote them not to create memory context for each group.
It could be done differently (eager accounting) but then the overhead
for regular/common cases (with just a couple of contexts) is higher. So
that seems like a much inferior option.
2) We can't actually have a single context per transaction. Some parts
(REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID) of a transaction are not
evicted, so we'd have to keep them in a separate context.
It'd also mean higher allocation overhead, because now we can reuse
chunks cross-transaction. So one transaction commits or gets serialized,
and we reuse the chunks for something else. With per-transaction
contexts we'd lose some of this benefit - we could only reuse chunks
within a transaction (i.e. large transactions that get spilled to disk)
but not across commits.
I don't have any numbers, of course, but I wouldn't be surprised if it
was significant e.g. for small transactions that don't get spilled. And
creating/destroying the contexts is not free either, I think.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-26 21:46:06 |
Message-ID: | 20190926214606.q6ad65r4umimc5w3@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 04:33:59PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-26, Tomas Vondra wrote:
>
>> Hi,
>>
>> Attached is an updated patch series, rebased on current master. It does
>> fix one memory accounting bug in ReorderBufferToastReplace (the code was
>> not properly updating the amount of memory).
>
>Cool.
>
>Can we aim to get 0001 pushed during this commitfest, or is that a lost
>cause?
>
It's tempting. The patch has been in the queue for quite a bit of time,
and I think it's solid (at least 0001). I'll address the comments from
Peter's review about separating the GUC etc. and polish it a bit more.
If I manage to do that by Monday, I'll consider pushing it.
If anyone feels I shouldn't do that, let me know.
The one open question pointed out by Amit is how the patch picks the
trasction for eviction. My feeling is that's fine and if needed can be
improved later if necessary, but I'll try to construct a worst case
(max_connections xacts, each with 64 subxact) to verify.
>The large new comment in reorderbuffer.c says that a transaction might
>get spilled *or streamed*, but surely that second thing is not correct,
>since before the subsequent patches it's not possible to stream
>transactions that have not yet finished?
>
True. That's a residue of reordering the patch series repeatedly, I
think. I'll fix that while polishing the patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-27 05:48:45 |
Message-ID: | CAA4eK1+0T=y02gSJ0SKgDLiDjJgZY551dwB8bjwGn80kf6_N8g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Sep 27, 2019 at 12:06 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:
>
> >3.
> >+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
> >+ *
> >+ * XXX With many subtransactions this might be quite slow, because we'll have
> >+ * to walk through all of them. There are some options how we could improve
> >+ * that: (a) maintain some secondary structure with transactions sorted by
> >+ * amount of changes, (b) not looking for the entirely largest transaction,
> >+ * but e.g. for transaction using at least some fraction of the memory limit,
> >+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
> >+ * of the memory limit (e.g. 50%).
> >+ */
> >+static ReorderBufferTXN *
> >+ReorderBufferLargestTXN(ReorderBuffer *rb)
> >
> >What is the guarantee that after evicting largest transaction, we
> >won't immediately hit the memory limit? Say, all of the transactions
> >are of almost similar size which I don't think is that uncommon a
> >case.
>
> Not sure I understand - what do you mean 'immediately hit'?
>
> We do check the limit after queueing a change, and we know that this
> change is what got us over the limit. We pick the largest transaction
> (which has to be larger than the change we just entered) and evict it,
> getting below the memory limit again.
>
> The next change can get us over the memory limit again, of course,
>
Yeah, this is what I want to say when I wrote that it can immediately hit again.
> but
> there's not much we could do about that.
>
> > Instead, the strategy mentioned in point (c) or something like
> >that seems more promising. In that strategy, there is some risk that
> >it might lead to many smaller disk writes which we might want to
> >control via some threshold (like we should not flush more than N
> >xacts). In this, we also need to ensure that the total memory freed
> >must be greater than the current change.
> >
> >I think we have some discussion around this point but didn't reach any
> >conclusion which means some more brainstorming is required.
> >
>
> I agree it's worth investigating, but I'm not sure it's necessary before
> committing v1 of the feature. I don't think there's a clear winner
> strategy, and the current approach works fairly well I think.
>
> The comment is concerned with the cost of ReorderBufferLargestTXN with
> many transactions, but we can only have certain number of top-level
> transactions (max_connections + certain number of not-yet-assigned
> subtransactions). And 0002 patch essentially gets rid of the subxacts
> entirely, further reducing the maximum number of xacts to walk.
>
That would be good, but I don't understand how. The second patch will
update the subxacts in top-level ReorderBufferTxn, but it won't remove
it from hash table. It also doesn't seem to be caring for considering
the size of subxacts in top-level xact, so not sure how will it reduce
the number of xacts to walk. I might be missing something here. Can
you explain a bit how 0002 patch would help in reducing the maximum
number of xacts to walk?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-27 09:03:32 |
Message-ID: | CAA4eK1Jz_8Sa8rHmUfJZgNpUUWqu9uqiyCmwAzqmGbRxyObgSg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>
> On 1/3/18 14:53, Tomas Vondra wrote:
> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> maintenance_work_mem is often set to very large values, which could
> >> then have undesirable side effects on this use.
> >
> > Well, we need to pick some default value, and we can either use a fixed
> > value (not sure what would be a good default) or tie it to an existing
> > GUC. We only really have work_mem and maintenance_work_mem, and the
> > walsender process will never use more than one such buffer. Which seems
> > to be closer to maintenance_work_mem.
> >
> > Pretty much any default value can have undesirable side effects.
>
> Let's just make it an independent setting unless we know any better. We
> don't have a lot of settings that depend on other settings, and the ones
> we do have a very specific relationship.
>
> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> version of work_mem. Maybe we could think of another name.
> >
> > I won't object to a better name, of course. Any proposals?
>
> logical_decoding_[work_]mem?
>
Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set. What are the problems we see if directly use work_mem for
this case?
If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-27 09:40:46 |
Message-ID: | CAA4eK1LtV+uMzf=_B1CKoSLr_K-+i9tUCjnFtouwqABNJPVobQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
> >On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> >>
> >> In the interest of moving things forward, how far are we from making
> >> 0001 committable? If I understand correctly, the rest of this patchset
> >> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >> moving at a glacial pace (or, actually, slower, because glaciers do
> >> move, which cannot be said of that other patch.)
> >>
> >
> >I am not sure if it is completely correct that the other part of the
> >patch is dependent on that CF entry. I have studied both the threads
> >(not every detail) and it seems to me it is dependent on one of the
> >patches from that series which handles concurrent aborts. It is patch
> >0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
> >from what the Nikhil has posted on that thread [1]. Am, I wrong?
> >
>
> You're right - the part handling aborts is the only part required. There
> are dependencies on some other changes from the 2PC patch, but those are
> mostly refactorings that can be undone (e.g. switch from independent
> flags to a single bitmap in reorderbuffer).
>
> >So IIUC, the problem of concurrent aborts is that if we allow catalog
> >scans for in-progress transactions, then we might get wrong answers in
> >cases where somebody has performed Alter-Abort-Alter which is clearly
> >explained with an example in email [2]. To solve that problem Nikhil
> >seems to have written a patch [1] which detects these concurrent
> >aborts during a system table scan and then aborts the decoding of such
> >a transaction.
> >
> >Now, the problem is that patch has written considering 2PC
> >transactions and might not deal with all cases for in-progress
> >transactions especially when sub-transactions are involved as alluded
> >by Arseny Sher [3]. So, the problem seems to be for cases when some
> >sub-transaction aborts, but the main transaction still continued and
> >we try to decode it. Nikhil's patch won't be able to deal with it
> >because I think it just checks top-level xid whereas for this we need
> >to check all-subxids which I think is possible now as Tomas seems to
> >have written WAL for each xid-assignment. It might or might not be
> >the best solution to check the status of all-subxids, but I think
> >first we need to agree that the problem is just for concurrent aborts
> >and that we can solve it by using some part of the technology being
> >developed as part of patch "Logical decoding of two-phase
> >transactions" (https://commitfest.postgresql.org/24/944/) rather than
> >the entire patchset.
> >
> >I hope I am not saying something very obvious here and it helps in
> >moving this patch forward.
> >
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong.
>
I also think the patch that handles concurrent aborts should be
sufficient, but that need to be integrated with your patch. Earlier,
I thought we need to check whether any of the subtransaction is
aborted as mentioned by Arseny Sher, but now after thinking again
about that problem, it seems that checking only the status current
subtransaction should be sufficient. Because, if the user does
Rollback to Savepoint concurrently which aborts multiple
subtransactions, the latest one must be aborted as well which is what
I think we want to detect. Once we detect that we have two options
(a) restart the decode of that transaction by removing changes of all
subxacts or (b) somehow mark the transaction such that it gets decoded
only at the commit time.
>
> Maybe we should focus on the 0001 part for now - it can be committed
> indepently and does provide useful feature.
>
If that can be done sooner, then it is fine, but otherwise, preparing
the patches on top of HEAD can facilitate the review of those.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-27 11:25:20 |
Message-ID: | 20190927112520.3kco6tqbhcvlk6cd@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
><peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>>
>> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> maintenance_work_mem is often set to very large values, which could
>> >> then have undesirable side effects on this use.
>> >
>> > Well, we need to pick some default value, and we can either use a fixed
>> > value (not sure what would be a good default) or tie it to an existing
>> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> > walsender process will never use more than one such buffer. Which seems
>> > to be closer to maintenance_work_mem.
>> >
>> > Pretty much any default value can have undesirable side effects.
>>
>> Let's just make it an independent setting unless we know any better. We
>> don't have a lot of settings that depend on other settings, and the ones
>> we do have a very specific relationship.
>>
>> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> version of work_mem. Maybe we could think of another name.
>> >
>> > I won't object to a better name, of course. Any proposals?
>>
>> logical_decoding_[work_]mem?
>>
>
>Having a separate variable for this can give more flexibility, but
>OTOH it will add one more knob which user might not have a good idea
>to set. What are the problems we see if directly use work_mem for
>this case?
>
IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.
It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.
FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).
>If we can't use work_mem, then I think the name proposed by you
>(logical_decoding_work_mem) sounds good to me.
>
Yes, that name seems better.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-28 08:06:46 |
Message-ID: | CAA4eK1J7NajKPE58Vnf5pBtwR3upxATUCRFoR04BPKbKBes5hw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
> ><peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> >>
> >> On 1/3/18 14:53, Tomas Vondra wrote:
> >> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> >> maintenance_work_mem is often set to very large values, which could
> >> >> then have undesirable side effects on this use.
> >> >
> >> > Well, we need to pick some default value, and we can either use a fixed
> >> > value (not sure what would be a good default) or tie it to an existing
> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
> >> > walsender process will never use more than one such buffer. Which seems
> >> > to be closer to maintenance_work_mem.
> >> >
> >> > Pretty much any default value can have undesirable side effects.
> >>
> >> Let's just make it an independent setting unless we know any better. We
> >> don't have a lot of settings that depend on other settings, and the ones
> >> we do have a very specific relationship.
> >>
> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> >> version of work_mem. Maybe we could think of another name.
> >> >
> >> > I won't object to a better name, of course. Any proposals?
> >>
> >> logical_decoding_[work_]mem?
> >>
> >
> >Having a separate variable for this can give more flexibility, but
> >OTOH it will add one more knob which user might not have a good idea
> >to set. What are the problems we see if directly use work_mem for
> >this case?
> >
>
> IMHO it's similar to autovacuum_work_mem - we have an independent
> setting, but most people use it as -1 so we use maintenance_work_mem as
> a default value. I think it makes sense to do the same thing here.
>
> It does ad an extra knob anyway (I don't think we should just use
> maintenance_work_mem directly, the user should have an option to
> override it when needed). But most users will not notice.
>
> FWIW I don't think we should use work_mem, maintenace_work_mem seems
> somewhat more appropriate here (not related to queries, etc.).
>
I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using. AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable. So, I think using work_mem or
maintenace_work_mem should serve the purpose. Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.
I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose. Another point is that the default for that sound to be more
appealing for this case. I can see the argument against it which is
having a separate variable will make the things look clean and give
better control. So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.
I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease. I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-28 12:03:10 |
Message-ID: | CAA4eK1+DrxvMNnApf94HxUb47a2GxjNUDy4q05y_kYmOAXPyTQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Hi,
>
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).
>
Few comments on 0001
1.
I am getting below linking error in pgoutput when compiling the patch
on my windows system:
pgoutput.obj : error LNK2001: unresolved external symbol _logical_work_mem
You need to use PGDLLIMPORT for logical_work_mem.
2. After, I fixed above and tried some basic test, it fails with below
callstack:
postgres.exe!ExceptionalCondition(const char *
conditionName=0x00d92854, const char * errorType=0x00d928bc, const
char * fileName=0x00d92e60,
int lineNumber=2148) Line 55
postgres.exe!ReorderBufferChangeMemoryUpdate(ReorderBuffer *
rb=0x02693390, ReorderBufferChange * change=0x0269dd38, bool
addition=true) Line 2148
postgres.exe!ReorderBufferQueueChange(ReorderBuffer * rb=0x02693390,
unsigned int xid=525, unsigned __int64 lsn=36083720,
ReorderBufferChange
* change=0x0269dd38) Line 635
postgres.exe!DecodeInsert(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718) Line 716 + 0x24 bytes C
postgres.exe!DecodeHeapOp(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718) Line 437 + 0xd bytes C
postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx=0x0268ef80, XLogReaderState * record=0x0268f228) Line 129
postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo=0x02688680, bool confirm=true, bool binary=false) Line 307
postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo=0x02688680) Line 376
Basically, the assert added by you in ReorderBufferChangeMemoryUpdate
failed. Then, I explored a bit and it seems that you have missed
assigning a value to txn, a new variable added by this patch in
structure ReorderBufferChange:
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
/* The type of change. */
enum ReorderBufferChangeType action;
+ /* Transaction this change belongs to. */
+ struct ReorderBufferTXN *txn;
3.
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>work_mem</literal> (<type>integer</type>)</term>
+ <listitem>
+ <para>
+ Limits the amount of memory used to decode changes on the
+ publisher. If not specified, the publisher will use the default
+ specified by <varname>logical_work_mem</varname>.
+ </para>
+ </listitem>
+ </varlistentry>
I don't see any explanation of how this will be useful? How can a
subscriber predict the amount of memory required by a publisher for
decoding? This is more unpredictable because when initially the
changes are recorded in ReorderBuffer, it doesn't even filter
corresponding to any publisher. Do we really need this? I think
giving more knobs to the user is helpful when they can someway know
how to use it. In this case, it is not clear whether the user can
ever use this.
4. Can we some way expose the memory consumed by ReorderBuffer? If
so, we might be able to write some tests covering new functionality.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-28 19:09:17 |
Message-ID: | 20190928190917.hrpknmq76v3ts3lj@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
>On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
>> ><peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>> >>
>> >> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> >> maintenance_work_mem is often set to very large values, which could
>> >> >> then have undesirable side effects on this use.
>> >> >
>> >> > Well, we need to pick some default value, and we can either use a fixed
>> >> > value (not sure what would be a good default) or tie it to an existing
>> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> >> > walsender process will never use more than one such buffer. Which seems
>> >> > to be closer to maintenance_work_mem.
>> >> >
>> >> > Pretty much any default value can have undesirable side effects.
>> >>
>> >> Let's just make it an independent setting unless we know any better. We
>> >> don't have a lot of settings that depend on other settings, and the ones
>> >> we do have a very specific relationship.
>> >>
>> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> >> version of work_mem. Maybe we could think of another name.
>> >> >
>> >> > I won't object to a better name, of course. Any proposals?
>> >>
>> >> logical_decoding_[work_]mem?
>> >>
>> >
>> >Having a separate variable for this can give more flexibility, but
>> >OTOH it will add one more knob which user might not have a good idea
>> >to set. What are the problems we see if directly use work_mem for
>> >this case?
>> >
>>
>> IMHO it's similar to autovacuum_work_mem - we have an independent
>> setting, but most people use it as -1 so we use maintenance_work_mem as
>> a default value. I think it makes sense to do the same thing here.
>>
>> It does ad an extra knob anyway (I don't think we should just use
>> maintenance_work_mem directly, the user should have an option to
>> override it when needed). But most users will not notice.
>>
>> FWIW I don't think we should use work_mem, maintenace_work_mem seems
>> somewhat more appropriate here (not related to queries, etc.).
>>
>
>I have the same concern for using maintenace_work_mem as Peter E.
>which is that the value of maintenace_work_mem will generally be
>higher which is suitable for its current purpose, but not for the
>purpose this patch is using. AFAIU, at this stage we want a better
>memory accounting system for logical decoding and we are not sure what
>is a good value for this variable. So, I think using work_mem or
>maintenace_work_mem should serve the purpose. Later, if we have
>requirements from people to have better control over the memory
>required for this purpose then we can introduce a new variable.
>
>I understand that currently work_mem is primarily tied with memory
>used for query workspaces, but it might be okay to extend it for this
>purpose. Another point is that the default for that sound to be more
>appealing for this case. I can see the argument against it which is
>having a separate variable will make the things look clean and give
>better control. So, if we can't convince ourselves for using
>work_mem, we can introduce a new guc variable and keep the default as
>4MB or work_mem.
>
>I feel it is always tempting to introduce a new guc for the different
>tasks unless there is an exact match, but OTOH, having lesser guc's
>has its own advantage which is that people don't have to bother about
>a new setting which they need to tune and especially for which they
>can't decide with ease. I am not telling that we should not introduce
>new guc when it is required, but just to give more thought before
>doing so.
>
I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue
IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.
I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.
So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m). It should also fix all the issues
from the recent reviews (at least I believe so).
I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-29 04:34:55 |
Message-ID: | CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong. I need to think
> about that for a while.
>
IIUC, for 2PC it's enough to check whether the main transaction is
aborted or not but for the in-progress transaction it's possible that
the current subtransaction might have done catalog changes and it
might get aborted when we are decoding. So we need to extend an
infrastructure such that we can check the status of the transaction
for which we are decoding the change. Also, I think we need to handle
the ERRCODE_TRANSACTION_ROLLBACK and ignore it.
I have attached a small patch to handle this which can be applied on
top of your patch set.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
handle_concurrent_abort_for_in_progress_transaction.patch | application/octet-stream | 1.7 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-29 05:54:10 |
Message-ID: | CAA4eK1LCk4aAeTv8SofwgmKLaFakj+Sn4Y7MLEJORX0BjY2hGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
> >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I do think having a separate GUC is a must, irrespectedly of what other
> GUC (if any) is used as a default. You're right the maintenance_work_mem
> value might be too high (e.g. in cases with many subscriptions), but the
> same issue applies to work_mem - there's no guarantee work_mem is lower
> than maintenance_work_mem, and in analytics databases it may be set very
> high. So work_mem does not really solve the issue
>
> IMHO we can't really do without a new GUC. It's not difficult to create
> examples that would benefit from small/large memory limit, depending on
> the number of subscriptions etc.
>
> I do however agree the GUC does not have to be tied to any existing one,
> it was just an attempt to use a more sensible default value. I do think
> m_w_m would be fine, but I can live with using an explicit value.
>
> So that's what I did in the attached patch - I've renamed the GUC to
> logical_decoding_work_mem, detached it from m_w_m and set the default to
> 64MB (i.e. the same default as m_w_m).
Fair enough, let's not argue more on this unless someone else wants to
share his opinion.
> It should also fix all the issues
> from the recent reviews (at least I believe so).
>
Have you given any thought on creating a test case for this patch? I
think you also told that you will test some worst-case scenarios and
report the numbers so that we are convinced that the current eviction
algorithm is good.
> I've realized that one of the subsequent patches allows overriding the
> limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
> I think it'd be good to move this bit forward, but I think it can be
> done in a separate patch.
>
Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter. I have mentioned the
same in the previous email as well.
While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem
I guess this need to be updated.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-29 17:30:44 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2019-Sep-29, Amit Kapila wrote:
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > So that's what I did in the attached patch - I've renamed the GUC to
> > logical_decoding_work_mem, detached it from m_w_m and set the default to
> > 64MB (i.e. the same default as m_w_m).
>
> Fair enough, let's not argue more on this unless someone else wants to
> share his opinion.
I just read this part of the conversation and I agree that having a
separate GUC with its own value independent from other GUCs is a good
solution. Tying it to m_w_m seemed reasonable, but it's true that
people frequently set m_w_m very high, and it would be undesirable to
propagate that value to logical decoding memory usage.
I wonder what would constitute good advice on how to set this value, I
mean what is the metric that the user needs to be thinking about. Is
it the total of memory required to keep all concurrent write transactions
in memory? (Quick example: if you do 2048 wTPS and each transaction
lasts 1s, and each transaction does 1kB of logically-decoded changes,
then ~2MB are sufficient for the average case. Is that correct? I
*think* that full-page images do not count, correct? With these things
in mind users could go through pg_waldump output and figure out what to
set the value to.)
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-09-29 17:56:30 |
Message-ID: | 20190929175630.jr3a6xnbksohqjwh@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Sep 29, 2019 at 02:30:44PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-29, Amit Kapila wrote:
>
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> > So that's what I did in the attached patch - I've renamed the GUC to
>> > logical_decoding_work_mem, detached it from m_w_m and set the default to
>> > 64MB (i.e. the same default as m_w_m).
>>
>> Fair enough, let's not argue more on this unless someone else wants to
>> share his opinion.
>
>I just read this part of the conversation and I agree that having a
>separate GUC with its own value independent from other GUCs is a good
>solution. Tying it to m_w_m seemed reasonable, but it's true that
>people frequently set m_w_m very high, and it would be undesirable to
>propagate that value to logical decoding memory usage.
>
>
>I wonder what would constitute good advice on how to set this value, I
>mean what is the metric that the user needs to be thinking about. Is
>it the total of memory required to keep all concurrent write transactions
>in memory? (Quick example: if you do 2048 wTPS and each transaction
>lasts 1s, and each transaction does 1kB of logically-decoded changes,
>then ~2MB are sufficient for the average case. Is that correct?
Yes, something like that. Essentially we'd like to keep all concurrent
transactions decoded in memory, to eliminate the need to spill to disk.
One of the subsequent patches adds some subscription-level stats, so
maybe we don't need to worry about this too much - the stats seem like a
better source of information for tuning.
>I *think* that full-page images do not count, correct? With these
>things in mind users could go through pg_waldump output and figure out
>what to set the value to.)
>
Right, FPW do not matter here.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-01 13:25:52 |
Message-ID: | CAA4eK1+-F_ELz+n3S9bBN8kOgZ6ww67tOoy99f8Ug39RWXH20w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
>
> Yeah, it is better to deal it separately as I am also not entirely
> convinced at this stage about this parameter. I have mentioned the
> same in the previous email as well.
>
> While glancing through the changes, I noticed a small thing:
> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
>
> I guess this need to be updated.
>
On further testing, I found that the patch seems to have problems with
toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
*--kaboom*
The second statement in Session-2 leads to a crash.
Other than that, I am not sure if the changes related to spill to disk
after logical_decoding_work_mem works for toast table as I couldn't hit
that code for toast table case, but I might be missing something. As
mentioned previously, I feel there should be some way to test whether this
patch works for the cases it claims to work. As of now, I have to check
via debugging. Let me know if there is any way, I can test this.
I am reluctant to say, but I think this patch still needs some more work
(review, test, rework) before we can commit it.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-01 13:51:51 |
Message-ID: | 20191001135151.swlspdnkxxnhszhn@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>wrote:
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >
>>
>> Yeah, it is better to deal it separately as I am also not entirely
>> convinced at this stage about this parameter. I have mentioned the
>> same in the previous email as well.
>>
>> While glancing through the changes, I noticed a small thing:
>> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
>maintenance_work_mem
>>
>> I guess this need to be updated.
>>
>
>On further testing, I found that the patch seems to have problems with
>toast. Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>
OK, thanks for the report - will investigate.
>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something. As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work. As of now, I have to check
>via debugging. Let me know if there is any way, I can test this.
>
That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.
>I am reluctant to say, but I think this patch still needs some more work
>(review, test, rework) before we can commit it.
>
I agreee.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-01 22:57:30 |
Message-ID: | CAA4eK1KB_N-gUNgUN9KysJyBnnVWd7_sE9x53EYQyv3XjVGojw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >
> >On further testing, I found that the patch seems to have problems with
> >toast. Consider below scenario:
> >Session-1
> >Create table large_text(t1 text);
> >INSERT INTO large_text
> >SELECT (SELECT string_agg('x', ',')
> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >
> >Session-2
> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >'test_decoding');
> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >*--kaboom*
> >
> >The second statement in Session-2 leads to a crash.
> >
>
> OK, thanks for the report - will investigate.
>
It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);
> >Other than that, I am not sure if the changes related to spill to disk
> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >that code for toast table case, but I might be missing something. As
> >mentioned previously, I feel there should be some way to test whether this
> >patch works for the cases it claims to work. As of now, I have to check
> >via debugging. Let me know if there is any way, I can test this.
> >
>
> That's one of the reasons why I proposed to move the statistics (which
> say how many transactions / bytes were spilled to disk) from a later
> patch in the series. I don't think there's a better way.
>
>
I like that idea, but I think you need to split that patch to only get the
stats related to the spill. It would be easier to review if you can
prepare that atop of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-02 22:32:54 |
Message-ID: | 20191002223254.wgftz4rzwd2m57gn@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast. Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>
Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.
>> >Other than that, I am not sure if the changes related to spill to disk
>> >after logical_decoding_work_mem works for toast table as I couldn't hit
>> >that code for toast table case, but I might be missing something. As
>> >mentioned previously, I feel there should be some way to test whether this
>> >patch works for the cases it claims to work. As of now, I have to check
>> >via debugging. Let me know if there is any way, I can test this.
>> >
>>
>> That's one of the reasons why I proposed to move the statistics (which
>> say how many transactions / bytes were spilled to disk) from a later
>> patch in the series. I don't think there's a better way.
>>
>>
>I like that idea, but I think you need to split that patch to only get the
>stats related to the spill. It would be easier to review if you can
>prepare that atop of
>0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
>
Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming. We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-03 07:48:26 |
Message-ID: | CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].
In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.
Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N time on master(sec) Total xact time (sec)
1kk 6 21
3kk 18 55
Stream + BGW pool
N time on master(sec) Total xact time (sec)
1kk 6 13
3kk 19 35
Patch details:
All the patches are the same as posted on [2] except
1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
removed the handling of error which is specific for 2PC
2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
patch to handle concurrent abort error for the in-progress transaction
and also add handling for the sub transaction's abort.
4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
Alexey's patch
[1] https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru
[2] https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast. Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something. As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work. As of now, I have to check
> >> >via debugging. Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill. It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming. We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-03 09:13:09 |
Message-ID: | CAA4eK1LnO=Zf_wb0TbfJzXidxbijOtGoVb2FwJ4DH+Bt1JCbVg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> >
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast. Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I
think in session-2 you need to create replication slot before creating
table in session-1 to see this problem.
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
change->data.tuplecid.cmax = cmax;
change->data.tuplecid.combocid = combocid;
change->lsn = lsn;
+ change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
dlist_push_tail(&txn->tuplecids, &change->node);
Few more comments:
-----------------------------------
1.
+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource
source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;
I think this needs to be changed as now we don't rely on
maintenance_work_mem. Another thing related to this is that I think the
default value for logical_decoding_work_mem still seems to be -1. We need
to make it to 64MB. I have seen this while debugging memory accounting
changes. I think this is the reason why I was not seeing toast related
changes being serialized because, in that test, I haven't changed the
default value of logical_decoding_work_mem.
2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
/going modify/going to modify/
3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb,
ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);
It is not very clear why this change is required. Basically, this is done
at commit time after which actually we shouldn't attempt to spill these
changes. This is mentioned in comments as well, but it is not clear if
that is the case, then how and when accounting can create a problem. If
possible, can you explain it with an example?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-10 10:27:49 |
Message-ID: | CAFiTN-vT+42xRbkw=hBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N time on master(sec) Total xact time (sec)
> 1kk 6 21
> 3kk 18 55
>
> Stream + BGW pool
> N time on master(sec) Total xact time (sec)
> 1kk 6 13
> 3kk 19 35
>
> Patch details:
> All the patches are the same as posted on [2] except
> 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
> removed the handling of error which is specific for 2PC
Here[1], I mentioned that I have removed the 2PC changes from
this[0006] patch but mistakenly I attached the original patch itself
instead of the modified version. So attaching the modified version of
only this patch other patches are the same.
> 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
> 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
> patch to handle concurrent abort error for the in-progress transaction
> and also add handling for the sub transaction's abort.
> 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
> Alexey's patch
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch | application/octet-stream | 12.4 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-13 06:54:50 |
Message-ID: | CAFiTN-v0sQ0_xER+=BEo5ZPduLn17AZ8pQs6GN1bsy0xnd=Vvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>> >wrote:
>> >
>> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >> >
>> >> >On further testing, I found that the patch seems to have problems with
>> >> >toast. Consider below scenario:
>> >> >Session-1
>> >> >Create table large_text(t1 text);
>> >> >INSERT INTO large_text
>> >> >SELECT (SELECT string_agg('x', ',')
>> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >> >
>> >> >Session-2
>> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >> >'test_decoding');
>> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >> >*--kaboom*
>> >> >
>> >> >The second statement in Session-2 leads to a crash.
>> >> >
>> >>
>> >> OK, thanks for the report - will investigate.
>> >>
>> >
>> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
>> >+ /* Check we're not mixing changes from different transactions. */
>> >+ Assert(change->txn == txn);
>> >
>>
>> Can you still reproduce this issue with the patch I sent on 28/9? I have
>> been unable to trigger the failure, and it seems pretty similar to the
>> failure you reported (and I fixed) on 28/9.
>
>
> Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I think in session-2 you need to create replication slot before creating table in session-1 to see this problem.
>
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
> change->data.tuplecid.cmax = cmax;
> change->data.tuplecid.combocid = combocid;
> change->lsn = lsn;
> + change->txn = txn;
> change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
> dlist_push_tail(&txn->tuplecids, &change->node);
>
> Few more comments:
> -----------------------------------
> 1.
> +static bool
> +check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
> +{
> + /*
> + * -1 indicates fallback.
> + *
> + * If we haven't yet changed the boot_val default of -1, just let it be.
> + * logical decoding will look to maintenance_work_mem instead.
> + */
> + if (*newval == -1)
> + return true;
> +
> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
>
> I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen this while debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.
>
> 2.
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
>
>
> /going modify/going to modify/
>
> 3.
> + *
> + * While updating the existing change with detoasted tuple data, we need to
> + * update the memory accounting info, because the change size will differ.
> + * Otherwise the accounting may get out of sync, triggering serialization
> + * at unexpected times.
> + *
> + * We simply subtract size of the change before rejiggering the tuple, and
> + * then adding the new size. This makes it look like the change was removed
> + * and then added back, except it only tweaks the accounting info.
> + *
> + * In particular it can't trigger serialization, which would be pointless
> + * anyway as it happens during commit processing right before handing
> + * the change to the output plugin.
> */
> static void
> ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> if (txn->toast_hash == NULL)
> return;
>
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
> + ReorderBufferChangeMemoryUpdate(rb, change, false);
>
> It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
>
IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions. So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-13 11:46:25 |
Message-ID: | CAA4eK1LpyYwiNVVpfbUwvPWAgiz5eBorWAvz2ft2ZEmC1B4ckA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> > */
> > static void
> > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > if (txn->toast_hash == NULL)
> > return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions. So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>
You are right. I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer. You can ignore this point or maybe
slightly adjust the comment to make it explicit.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-14 01:21:20 |
Message-ID: | CAMsr+YHsoNoLncPBN2gsh9W7MQMc9ARiPMb061zBznnBnXp9HQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
> >
> > On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > 3.
> > > + *
> > > + * While updating the existing change with detoasted tuple data, we
> need to
> > > + * update the memory accounting info, because the change size will
> differ.
> > > + * Otherwise the accounting may get out of sync, triggering
> serialization
> > > + * at unexpected times.
> > > + *
> > > + * We simply subtract size of the change before rejiggering the
> tuple, and
> > > + * then adding the new size. This makes it look like the change was
> removed
> > > + * and then added back, except it only tweaks the accounting info.
> > > + *
> > > + * In particular it can't trigger serialization, which would be
> pointless
> > > + * anyway as it happens during commit processing right before handing
> > > + * the change to the output plugin.
> > > */
> > > static void
> > > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
> > > if (txn->toast_hash == NULL)
> > > return;
> > >
> > > + /*
> > > + * We're going modify the size of the change, so to make sure the
> > > + * accounting is correct we'll make it look like we're removing the
> > > + * change now (with the old size), and then re-add it at the end.
> > > + */
> > > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> > >
> > > It is not very clear why this change is required. Basically, this is
> done at commit time after which actually we shouldn't attempt to spill
> these changes. This is mentioned in comments as well, but it is not clear
> if that is the case, then how and when accounting can create a problem. If
> possible, can you explain it with an example?
> > >
> > IIUC, we are keeping the track of the memory in ReorderBuffer which is
> > common across the transactions. So even if this transaction is
> > committing and will not spill to dis but we need to keep the memory
> > accounting correct for the future changes in other transactions.
> >
>
> You are right. I somehow missed that we need to keep the size
> computation in sync even during commit for other in-progress
> transactions in the ReorderBuffer. You can ignore this point or maybe
> slightly adjust the comment to make it explicit.
Does anyone object if we add the reorder buffer total size & in-memory size
to struct WalSnd too, so we can report it in pg_stat_replication?
I can follow up with a patch to add on top of this one if you think it's
reasonable. I'll also take the opportunity to add a number of tracepoints
across the walsender and logical decoding, since right now it's very opaque
in production systems ... and everyone just LOVES hunting down debug syms
and attaching gdb to production DBs.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-14 02:11:23 |
Message-ID: | CAA4eK1KuqVOScXzgd37nefhzV=K4Q9Jtnno355snsFNRyO60Eg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Oct 14, 2019 at 6:51 AM Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
> On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>
>
> Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it in pg_stat_replication?
>
There is already a patch
(0011-Track-statistics-for-streaming-spilling) in this series posted
by Tomas[1] which tracks important statistics in WalSnd which I think
are good enough. Have you checked that? I am not sure if adding
additional size will help, but I might be missing something.
> I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity to add a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production systems ... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs.
>
Sure, adding tracepoints can be helpful, but isn't it better to start
that as a separate thread?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-14 09:39:02 |
Message-ID: | CAFiTN-v3-pEeY31C+9G9vRX+G=yg8QZL66mWws3bLcdg_58HcA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast. Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something. As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work. As of now, I have to check
> >> >via debugging. Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill. It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming. We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
I have extracted the spilling related code to a separate patch on top
of 0001. I have also fixed some bugs and review comments and attached
as a separate patch. Later I can merge it to the main patch if you
agree with the changes.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch | application/octet-stream | 33.6 KB |
bugs_and_review_comments_fix.patch | application/octet-stream | 3.7 KB |
0002-Track-statistics-for-spilling.patch | application/octet-stream | 11.0 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-18 12:01:58 |
Message-ID: | CAA4eK1LXen1B2dXZeimMvb+St9gm0+NQ_jFDsE=oVtCTwZY19A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> >
> > Sure, I wasn't really proposing to adding all stats from that patch,
> > including those related to streaming. We need to extract just those
> > related to spilling. And yes, it needs to be moved right after 0001.
> >
> I have extracted the spilling related code to a separate patch on top
> of 0001. I have also fixed some bugs and review comments and attached
> as a separate patch. Later I can merge it to the main patch if you
> agree with the changes.
>
Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },
I think we can remove 'or streaming' from above sentence for now. We
can add it later with later patch where streaming will be allowed.
2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><literal>work_mem</literal> (<type>integer</type>)</term>
+ <listitem>
+ <para>
+ Limits the amount of memory used to decode changes on the
+ publisher. If not specified, the publisher will use the default
+ specified by <varname>logical_decoding_work_mem</varname>. When
+ needed, additional data are spilled to disk.
+ </para>
+ </listitem>
+ </varlistentry>
It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].
bugs_and_review_comments_fix
1.
},
&logical_decoding_work_mem,
- -1, -1, MAX_KILOBYTES,
- check_logical_decoding_work_mem, NULL, NULL
+ 65536, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL
I think the default value should be 1MB similar to
maintenance_work_mem. The same was true before this change.
2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
+i#logical_decoding_work_mem = 64MB # min 64kB
It seems the 'i' is a leftover character in the above change. Also,
change the default value considering the previous point.
3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
/* update the statistics */
rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;
Why is this change required? Shouldn't we increase the spillTxns
count only when the txn is serialized?
0002-Track-statistics-for-spilling
1.
+ <row>
+ <entry><structfield>spill_txns</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>Number of transactions spilled to disk after the memory used by
+ logical decoding exceeds <literal>logical_work_mem</literal>. The
+ counter gets incremented both for toplevel transactions and
+ subtransactions.
+ </entry>
+ </row>
The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
2.
+ <row>
+ <entry><structfield>spill_txns</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>Number of transactions spilled to disk after the memory used by
+ logical decoding exceeds <literal>logical_work_mem</literal>. The
+ counter gets incremented both for toplevel transactions and
+ subtransactions.
+ </entry>
+ </row>
+ <row>
+ <entry><structfield>spill_count</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>Number of times transactions were spilled to disk. Transactions
+ may get spilled repeatedly, and this counter gets incremented on every
+ such invocation.
+ </entry>
+ </row>
+ <row>
+ <entry><structfield>spill_bytes</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>Amount of decoded transaction data spilled to disk.
+ </entry>
+ </row>
In all the above cases, the explanation text starts immediately after
<entry> tag, but the general coding practice is to start from the next
line, see the explanation of nearby parameters.
It seems these parameters are added in pg-stat-wal-receiver-view in
the docs, but in code, it is present as part of pg_stat_replication.
It seems doc needs to be updated. Am, I missing something?
3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;
Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}
I am not able to understand the above code. We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?
Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-21 05:18:33 |
Message-ID: | CAFiTN-szFoN-wuZ8SXV4Xed+ekyvBM9WZDm9ocp=iauDf6CWJQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
I have replied to some of your questions inline. I will work on the
remaining comments and post the patch for the same.
> > >
> > > Sure, I wasn't really proposing to adding all stats from that patch,
> > > including those related to streaming. We need to extract just those
> > > related to spilling. And yes, it needs to be moved right after 0001.
> > >
> > I have extracted the spilling related code to a separate patch on top
> > of 0001. I have also fixed some bugs and review comments and attached
> > as a separate patch. Later I can merge it to the main patch if you
> > agree with the changes.
> >
>
> Few comments
> -------------------------
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
> 1.
> + {
> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
> + gettext_noop("This much memory can be used by each internal "
> + "reorder buffer before spilling to disk or streaming."),
> + GUC_UNIT_KB
> + },
>
> I think we can remove 'or streaming' from above sentence for now. We
> can add it later with later patch where streaming will be allowed.
>
> 2.
> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
> </para>
> </listitem>
> </varlistentry>
> +
> + <varlistentry>
> + <term><literal>work_mem</literal> (<type>integer</type>)</term>
> + <listitem>
> + <para>
> + Limits the amount of memory used to decode changes on the
> + publisher. If not specified, the publisher will use the default
> + specified by <varname>logical_decoding_work_mem</varname>. When
> + needed, additional data are spilled to disk.
> + </para>
> + </listitem>
> + </varlistentry>
>
> It is not clear why we need this parameter at least with this patch?
> I have raised this multiple times [1][2].
>
> bugs_and_review_comments_fix
> 1.
> },
> &logical_decoding_work_mem,
> - -1, -1, MAX_KILOBYTES,
> - check_logical_decoding_work_mem, NULL, NULL
> + 65536, 64, MAX_KILOBYTES,
> + NULL, NULL, NULL
>
> I think the default value should be 1MB similar to
> maintenance_work_mem. The same was true before this change.
>
> 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
> maintenance_work_mem
> +i#logical_decoding_work_mem = 64MB # min 64kB
>
> It seems the 'i' is a leftover character in the above change. Also,
> change the default value considering the previous point.
>
> 3.
> @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
> /* update the statistics */
> rb->spillCount += 1;
> - rb->spillTxns += txn->serialized ? 1 : 0;
> + rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Why is this change required? Shouldn't we increase the spillTxns
> count only when the txn is serialized?
Prior to this change it was increasing the rb->spillTxns, every time
we try to serialize the changes of the transaction. Now, only we
increase first time when it is not yet serialized.
> 0002-Track-statistics-for-spilling
> 1.
> + <row>
> + <entry><structfield>spill_txns</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of transactions spilled to disk after the memory used by
> + logical decoding exceeds <literal>logical_work_mem</literal>. The
> + counter gets incremented both for toplevel transactions and
> + subtransactions.
> + </entry>
> + </row>
>
> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
>
> 2.
> + <row>
> + <entry><structfield>spill_txns</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of transactions spilled to disk after the memory used by
> + logical decoding exceeds <literal>logical_work_mem</literal>. The
> + counter gets incremented both for toplevel transactions and
> + subtransactions.
> + </entry>
> + </row>
> + <row>
> + <entry><structfield>spill_count</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of times transactions were spilled to disk. Transactions
> + may get spilled repeatedly, and this counter gets incremented on every
> + such invocation.
> + </entry>
> + </row>
> + <row>
> + <entry><structfield>spill_bytes</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Amount of decoded transaction data spilled to disk.
> + </entry>
> + </row>
>
> In all the above cases, the explanation text starts immediately after
> <entry> tag, but the general coding practice is to start from the next
> line, see the explanation of nearby parameters.
>
> It seems these parameters are added in pg-stat-wal-receiver-view in
> the docs, but in code, it is present as part of pg_stat_replication.
> It seems doc needs to be updated. Am, I missing something?
>
> 3.
> ReorderBufferSerializeTXN()
> {
> ..
> /* update the statistics */
> rb->spillCount += 1;
> rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Assert(spilled == txn->nentries_mem);
> Assert(dlist_is_empty(&txn->changes));
> txn->nentries_mem = 0;
> txn->serialized = true;
> ..
> }
>
> I am not able to understand the above code. We are setting the
> serialized parameter a few lines after we check it and increment the
> spillTxns count. Can you please explain it?
Basically, when the first time we attempt to serialize a transaction,
txn->serialized will be false, that time we will increment the
rb->spillTxns and after that set txn->serialized to true. From next
time onwards if we try to serialize the same transaction we will not
increment the rb->spillTxns so that we count each transaction only
once.
>
> Also, isn't spillTxns count bit confusing, because in some cases it
> will include subtransactions and other cases (where the largest picked
> transaction is a subtransaction) it won't include it?
I did not understand your comment completely. Basically, every
transaction which we are serializing we will increase the count first
time right? whether it is the main transaction or the sub-transaction.
Am I missing something?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-21 09:20:14 |
Message-ID: | CAA4eK1+5sJcrGwkktcJmrTtxdD-XwwngsTYy0LucqfuTvoSyHw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > 3.
> > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn)
> >
> > /* update the statistics */
> > rb->spillCount += 1;
> > - rb->spillTxns += txn->serialized ? 1 : 0;
> > + rb->spillTxns += txn->serialized ? 0 : 1;
> > rb->spillBytes += size;
> >
> > Why is this change required? Shouldn't we increase the spillTxns
> > count only when the txn is serialized?
>
> Prior to this change it was increasing the rb->spillTxns, every time
> we try to serialize the changes of the transaction. Now, only we
> increase first time when it is not yet serialized.
>
> >
> > 3.
> > ReorderBufferSerializeTXN()
> > {
> > ..
> > /* update the statistics */
> > rb->spillCount += 1;
> > rb->spillTxns += txn->serialized ? 0 : 1;
> > rb->spillBytes += size;
> >
> > Assert(spilled == txn->nentries_mem);
> > Assert(dlist_is_empty(&txn->changes));
> > txn->nentries_mem = 0;
> > txn->serialized = true;
> > ..
> > }
> >
> > I am not able to understand the above code. We are setting the
> > serialized parameter a few lines after we check it and increment the
> > spillTxns count. Can you please explain it?
>
> Basically, when the first time we attempt to serialize a transaction,
> txn->serialized will be false, that time we will increment the
> rb->spillTxns and after that set txn->serialized to true. From next
> time onwards if we try to serialize the same transaction we will not
> increment the rb->spillTxns so that we count each transaction only
> once.
>
Your explanation for both the above comments makes sense to me. Can
you please add some comments along these lines because it is not
apparent why one wants to increase the spillTxns counter when
txn->serialized is false?
> >
> > Also, isn't spillTxns count bit confusing, because in some cases it
> > will include subtransactions and other cases (where the largest picked
> > transaction is a subtransaction) it won't include it?
>
> I did not understand your comment completely. Basically, every
> transaction which we are serializing we will increase the count first
> time right? whether it is the main transaction or the sub-transaction.
>
It was not clear to me earlier whether we always increase the
spillTxns counter for subtransactions or not. But now, looking at
code carefully, it is clear that is it is getting increased in every
case. In short, you don't need to do anything for this comment.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-21 09:41:23 |
Message-ID: | CAFiTN-ukD55eaQhCeGwEWnbKUkZ+5K98OV8OckptwpZNL_v80g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Oct 21, 2019 at 2:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > > 3.
> > > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn)
> > >
> > > /* update the statistics */
> > > rb->spillCount += 1;
> > > - rb->spillTxns += txn->serialized ? 1 : 0;
> > > + rb->spillTxns += txn->serialized ? 0 : 1;
> > > rb->spillBytes += size;
> > >
> > > Why is this change required? Shouldn't we increase the spillTxns
> > > count only when the txn is serialized?
> >
> > Prior to this change it was increasing the rb->spillTxns, every time
> > we try to serialize the changes of the transaction. Now, only we
> > increase first time when it is not yet serialized.
> >
> > >
> > > 3.
> > > ReorderBufferSerializeTXN()
> > > {
> > > ..
> > > /* update the statistics */
> > > rb->spillCount += 1;
> > > rb->spillTxns += txn->serialized ? 0 : 1;
> > > rb->spillBytes += size;
> > >
> > > Assert(spilled == txn->nentries_mem);
> > > Assert(dlist_is_empty(&txn->changes));
> > > txn->nentries_mem = 0;
> > > txn->serialized = true;
> > > ..
> > > }
> > >
> > > I am not able to understand the above code. We are setting the
> > > serialized parameter a few lines after we check it and increment the
> > > spillTxns count. Can you please explain it?
> >
> > Basically, when the first time we attempt to serialize a transaction,
> > txn->serialized will be false, that time we will increment the
> > rb->spillTxns and after that set txn->serialized to true. From next
> > time onwards if we try to serialize the same transaction we will not
> > increment the rb->spillTxns so that we count each transaction only
> > once.
> >
>
> Your explanation for both the above comments makes sense to me. Can
> you please add some comments along these lines because it is not
> apparent why one wants to increase the spillTxns counter when
> txn->serialized is false?
Ok, I will add comments in the next patch.
>
> > >
> > > Also, isn't spillTxns count bit confusing, because in some cases it
> > > will include subtransactions and other cases (where the largest picked
> > > transaction is a subtransaction) it won't include it?
> >
> > I did not understand your comment completely. Basically, every
> > transaction which we are serializing we will increase the count first
> > time right? whether it is the main transaction or the sub-transaction.
> >
>
> It was not clear to me earlier whether we always increase the
> spillTxns counter for subtransactions or not. But now, looking at
> code carefully, it is clear that is it is getting increased in every
> case. In short, you don't need to do anything for this comment.
ok
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 05:00:16 |
Message-ID: | CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > >
> > > Sure, I wasn't really proposing to adding all stats from that patch,
> > > including those related to streaming. We need to extract just those
> > > related to spilling. And yes, it needs to be moved right after 0001.
> > >
> > I have extracted the spilling related code to a separate patch on top
> > of 0001. I have also fixed some bugs and review comments and attached
> > as a separate patch. Later I can merge it to the main patch if you
> > agree with the changes.
> >
>
> Few comments
> -------------------------
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
> 1.
> + {
> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
> + gettext_noop("This much memory can be used by each internal "
> + "reorder buffer before spilling to disk or streaming."),
> + GUC_UNIT_KB
> + },
>
> I think we can remove 'or streaming' from above sentence for now. We
> can add it later with later patch where streaming will be allowed.
Done
>
> 2.
> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
> </para>
> </listitem>
> </varlistentry>
> +
> + <varlistentry>
> + <term><literal>work_mem</literal> (<type>integer</type>)</term>
> + <listitem>
> + <para>
> + Limits the amount of memory used to decode changes on the
> + publisher. If not specified, the publisher will use the default
> + specified by <varname>logical_decoding_work_mem</varname>. When
> + needed, additional data are spilled to disk.
> + </para>
> + </listitem>
> + </varlistentry>
>
> It is not clear why we need this parameter at least with this patch?
> I have raised this multiple times [1][2].
I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.
>
> bugs_and_review_comments_fix
> 1.
> },
> &logical_decoding_work_mem,
> - -1, -1, MAX_KILOBYTES,
> - check_logical_decoding_work_mem, NULL, NULL
> + 65536, 64, MAX_KILOBYTES,
> + NULL, NULL, NULL
>
> I think the default value should be 1MB similar to
> maintenance_work_mem. The same was true before this change.
default value for maintenance_work_mem is also 64MB. Did you mean min value?
>
> 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
> maintenance_work_mem
> +i#logical_decoding_work_mem = 64MB # min 64kB
>
> It seems the 'i' is a leftover character in the above change. Also,
> change the default value considering the previous point.
oops, fixed.
>
> 3.
> @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
> /* update the statistics */
> rb->spillCount += 1;
> - rb->spillTxns += txn->serialized ? 1 : 0;
> + rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Why is this change required? Shouldn't we increase the spillTxns
> count only when the txn is serialized?
Already agreed in previous mail so added comments
>
> 0002-Track-statistics-for-spilling
> 1.
> + <row>
> + <entry><structfield>spill_txns</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of transactions spilled to disk after the memory used by
> + logical decoding exceeds <literal>logical_work_mem</literal>. The
> + counter gets incremented both for toplevel transactions and
> + subtransactions.
> + </entry>
> + </row>
>
> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
done
>
> 2.
> + <row>
> + <entry><structfield>spill_txns</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of transactions spilled to disk after the memory used by
> + logical decoding exceeds <literal>logical_work_mem</literal>. The
> + counter gets incremented both for toplevel transactions and
> + subtransactions.
> + </entry>
> + </row>
> + <row>
> + <entry><structfield>spill_count</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Number of times transactions were spilled to disk. Transactions
> + may get spilled repeatedly, and this counter gets incremented on every
> + such invocation.
> + </entry>
> + </row>
> + <row>
> + <entry><structfield>spill_bytes</structfield></entry>
> + <entry><type>integer</type></entry>
> + <entry>Amount of decoded transaction data spilled to disk.
> + </entry>
> + </row>
>
> In all the above cases, the explanation text starts immediately after
> <entry> tag, but the general coding practice is to start from the next
> line, see the explanation of nearby parameters.
It seems it's mixed, for example, you can see
<entry>Timeline number of last write-ahead log location received and
flushed to disk, the initial value of this field being the timeline
number of the first log location used when WAL receiver is started
</entry>
or
<entry>Timeline number of last write-ahead log location received and
flushed to disk, the initial value of this field being the timeline
number of the first log location used when WAL receiver is started
</entry>
>
> It seems these parameters are added in pg-stat-wal-receiver-view in
> the docs, but in code, it is present as part of pg_stat_replication.
> It seems doc needs to be updated. Am, I missing something?
Fixed
>
> 3.
> ReorderBufferSerializeTXN()
> {
> ..
> /* update the statistics */
> rb->spillCount += 1;
> rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Assert(spilled == txn->nentries_mem);
> Assert(dlist_is_empty(&txn->changes));
> txn->nentries_mem = 0;
> txn->serialized = true;
> ..
> }
>
> I am not able to understand the above code. We are setting the
> serialized parameter a few lines after we check it and increment the
> spillTxns count. Can you please explain it?
>
> Also, isn't spillTxns count bit confusing, because in some cases it
> will include subtransactions and other cases (where the largest picked
> transaction is a subtransaction) it won't include it?
>
Already discussed in the last mail.
I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch | application/octet-stream | 21.3 KB |
0002-Track-statistics-for-spilling.patch | application/octet-stream | 11.2 KB |
0003-Support-logical_decoding_work_mem-set-from-create-su.patch | application/octet-stream | 13.0 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 05:16:35 |
Message-ID: | CAA4eK1KAaiGBwZ=aDJ4f+Pazq3UJ8fsjPqa0+JfpM49ieSjtBA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N time on master(sec) Total xact time (sec)
> 1kk 6 21
> 3kk 18 55
>
> Stream + BGW pool
> N time on master(sec) Total xact time (sec)
> 1kk 6 13
> 3kk 19 35
>
I think the test results for the master are missing. Also, how about
running these tests over a network (means master and subscriber are
not on the same machine)? In general, yours and Alexy's test results
show that there is merit by having workers applying such transactions.
OTOH, as noted above [1], we are also worried about the performance
of Rollbacks if we follow that approach. I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold? I think that might help saving
I/O in many cases. I am not very sure if we do that how much
additional workers can help, but they might still help. I think we
need to do some tests and experiments to figure out what is the best
approach? What do you think?
Tomas, Alexey, do you have any thoughts on this matter? I think it is
important that we figure out the way to proceed in this patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | a(dot)kondratov(at)postgrespro(dot)ru, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 05:31:48 |
Message-ID: | CAFiTN-vdWS66KQDkz+Xf4KK3hWBJs6durSwTUuG+PQs-P29hxA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have attempted to test the performance of (Stream + Spill) vs
> > (Stream + BGW pool) and I can see the similar gain what Alexey had
> > shown[1].
> >
> > In addition to this, I have rebased the latest patchset [2] without
> > the two-phase logical decoding patch set.
> >
> > Test results:
> > I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> > here is my result
> > Stream + Spill
> > N time on master(sec) Total xact time (sec)
> > 1kk 6 21
> > 3kk 18 55
> >
> > Stream + BGW pool
> > N time on master(sec) Total xact time (sec)
> > 1kk 6 13
> > 3kk 19 35
> >
>
> I think the test results for the master are missing.
Yeah, That time, I was planning to compare spill vs bgworker.
Also, how about
> running these tests over a network (means master and subscriber are
> not on the same machine)?
Yeah, we should do that that will show the merit of streaming the
in-progress transactions.
In general, yours and Alexy's test results
> show that there is merit by having workers applying such transactions.
> OTOH, as noted above [1], we are also worried about the performance
> of Rollbacks if we follow that approach. I am not sure how much we
> need to worry about Rollabcks if commits are faster, but can we think
> of recording the changes in memory and only write to a file if the
> changes are above a certain threshold? I think that might help saving
> I/O in many cases. I am not very sure if we do that how much
> additional workers can help, but they might still help. I think we
> need to do some tests and experiments to figure out what is the best
> approach? What do you think?
I agree with the point. I think we might need to do some small
changes and test to see what could be the best method to handle the
streamed changes at the subscriber end.
>
> Tomas, Alexey, do you have any thoughts on this matter? I think it is
> important that we figure out the way to proceed in this patch.
>
> [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 17:12:26 |
Message-ID: | 20191022171226.cmrv4vle3ghm4wdm@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:
>On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
>> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> > >
>> > >
>> > > Sure, I wasn't really proposing to adding all stats from that patch,
>> > > including those related to streaming. We need to extract just those
>> > > related to spilling. And yes, it needs to be moved right after 0001.
>> > >
>> > I have extracted the spilling related code to a separate patch on top
>> > of 0001. I have also fixed some bugs and review comments and attached
>> > as a separate patch. Later I can merge it to the main patch if you
>> > agree with the changes.
>> >
>>
>> Few comments
>> -------------------------
>> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
>> 1.
>> + {
>> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
>> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
>> + gettext_noop("This much memory can be used by each internal "
>> + "reorder buffer before spilling to disk or streaming."),
>> + GUC_UNIT_KB
>> + },
>>
>> I think we can remove 'or streaming' from above sentence for now. We
>> can add it later with later patch where streaming will be allowed.
>Done
>>
>> 2.
>> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
>> class="parameter">subscription_name</replaceabl
>> </para>
>> </listitem>
>> </varlistentry>
>> +
>> + <varlistentry>
>> + <term><literal>work_mem</literal> (<type>integer</type>)</term>
>> + <listitem>
>> + <para>
>> + Limits the amount of memory used to decode changes on the
>> + publisher. If not specified, the publisher will use the default
>> + specified by <varname>logical_decoding_work_mem</varname>. When
>> + needed, additional data are spilled to disk.
>> + </para>
>> + </listitem>
>> + </varlistentry>
>>
>> It is not clear why we need this parameter at least with this patch?
>> I have raised this multiple times [1][2].
>
>I have moved it out as a separate patch (0003) so that if we need that
>we need this for the streaming transaction then we can keep this.
>>
I'm OK with moving it to a separate patch. That being said I think
ability to control memory usage for individual subscriptions is very
useful. Saying "We don't need such parameter" is essentially equivalent
to saying "One size fits all" and I think we know that's not true.
Imagine a system with multiple subscriptions, some of them mostly
replicating OLTP changes, but one or two replicating tables that are
updated in batches. What we'd have is to allow higher limit for the
batch subscriptions, but much lower limit for the OLTP ones (which they
should never hit in practice).
With a single global GUC, you'll either have a high value - risking
OOM when the OLTP subscriptions happen to decode a batch update, or a
low value affecting the batch subscriotions.
It's not strictly necessary (and we already have such limit), so I'm OK
with treating it as an enhancement for the future.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 17:22:10 |
Message-ID: | 20191022172210.26bdiv44vwvunrh3@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:
>On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > I have attempted to test the performance of (Stream + Spill) vs
>> > (Stream + BGW pool) and I can see the similar gain what Alexey had
>> > shown[1].
>> >
>> > In addition to this, I have rebased the latest patchset [2] without
>> > the two-phase logical decoding patch set.
>> >
>> > Test results:
>> > I have repeated the same test as Alexy[1] for 1kk and 1kk data and
>> > here is my result
>> > Stream + Spill
>> > N time on master(sec) Total xact time (sec)
>> > 1kk 6 21
>> > 3kk 18 55
>> >
>> > Stream + BGW pool
>> > N time on master(sec) Total xact time (sec)
>> > 1kk 6 13
>> > 3kk 19 35
>> >
>>
>> I think the test results for the master are missing.
>Yeah, That time, I was planning to compare spill vs bgworker.
> Also, how about
>> running these tests over a network (means master and subscriber are
>> not on the same machine)?
>
>Yeah, we should do that that will show the merit of streaming the
>in-progress transactions.
>
Which I agree it's an interesting feature, I think we need to stop
adding more stuff to this patch series - it's already complex enough, so
making it even more (unnecessary) stuff is a distraction and will make
it harder to get anything committed. Typical "scope creep".
I think the current behavior (spill to file) is sufficient for v0 and
can be improved later - that's fine. I don't think we need to bother
with comparisons to master very much, because while it might be a bit
slower in some cases, you can always disable streaming (so if there's a
regression for your workload, you can undo that).
> In general, yours and Alexy's test results
>> show that there is merit by having workers applying such transactions.
>> OTOH, as noted above [1], we are also worried about the performance
>> of Rollbacks if we follow that approach. I am not sure how much we
>> need to worry about Rollabcks if commits are faster, but can we think
>> of recording the changes in memory and only write to a file if the
>> changes are above a certain threshold? I think that might help saving
>> I/O in many cases. I am not very sure if we do that how much
>> additional workers can help, but they might still help. I think we
>> need to do some tests and experiments to figure out what is the best
>> approach? What do you think?
>I agree with the point. I think we might need to do some small
>changes and test to see what could be the best method to handle the
>streamed changes at the subscriber end.
>
>>
>> Tomas, Alexey, do you have any thoughts on this matter? I think it is
>> important that we figure out the way to proceed in this patch.
>>
>> [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>>
>
I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-22 19:02:38 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 22.10.2019 20:22, Tomas Vondra wrote:
> On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:
>> On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila
>> <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> In general, yours and Alexy's test results
>>> show that there is merit by having workers applying such transactions.
>>> OTOH, as noted above [1], we are also worried about the performance
>>> of Rollbacks if we follow that approach. I am not sure how much we
>>> need to worry about Rollabcks if commits are faster, but can we think
>>> of recording the changes in memory and only write to a file if the
>>> changes are above a certain threshold? I think that might help saving
>>> I/O in many cases. I am not very sure if we do that how much
>>> additional workers can help, but they might still help. I think we
>>> need to do some tests and experiments to figure out what is the best
>>> approach? What do you think?
>> I agree with the point. I think we might need to do some small
>> changes and test to see what could be the best method to handle the
>> streamed changes at the subscriber end.
>>
>>>
>>> Tomas, Alexey, do you have any thoughts on this matter? I think it is
>>> important that we figure out the way to proceed in this patch.
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>>>
>>
>
> I think the patch should do the simplest thing possible, i.e. what it
> does today. Otherwise we'll never get it committed.
>
I have to agree with Tomas, that keeping things as simple as possible
should be a main priority right now. Otherwise, the entire patch set
will pass next release cycle without being committed at least partially.
In the same time, it resolves important problem from my perspective. It
moves I/O overhead from primary to replica using large transactions
streaming, which is a nice to have feature I guess.
Later it would be possible to replace logical apply worker with
bgworkers pool in a separated patch, if we decide that it is a viable
solution. Anyway, regarding the Amit's questions:
- I doubt that maintaining a separate buffer on the apply side before
spilling to disk would help enough. We already have ReorderBuffer with
logical_work_mem limit, and if we exceeded that limit on the sender
side, then most probably we exceed it on the applier side as well,
excepting the case when this new buffer size will be significantly
higher then logical_work_mem to keep multiple open xacts.
- I still think that we should optimize database for commits, not
rollbacks. BGworkers pool is dramatically slower for rollbacks-only
load, though being at least twice as faster for commits-only. I do not
know how it will perform with real life load, but this drawback may be
inappropriate for such a general purpose database like Postgres.
- Tomas' implementation of streaming with spilling does not have this
bias between commits/aborts. However, it has a noticeable performance
drop (~x5 slower compared with master [1]) for large transaction
consisting of many small rows. Although it is not of an order of
magnitude slower.
Another thing is it that about a year ago I have found some problems
with MVCC/visibility and fixed them somehow [1]. If I get it correctly
Tomas adapted some of those fixes into his patch set, but I think that
this part should be reviewed carefully again. I would be glad to check
it, but now I am a little bit confused with all the patch set variants
in the thread. Which is the last one? Is it still dependent on 2pc decoding?
Thanks for moving this patch forward!
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-23 09:13:09 |
Message-ID: | CAA4eK1J+3kab6RSZrgj0YiQV1r+H3FWVaNjKhWvpEe5-bpZiBw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 10:42 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:
> >
> >I have moved it out as a separate patch (0003) so that if we need that
> >we need this for the streaming transaction then we can keep this.
> >>
>
> I'm OK with moving it to a separate patch. That being said I think
> ability to control memory usage for individual subscriptions is very
> useful. Saying "We don't need such parameter" is essentially equivalent
> to saying "One size fits all" and I think we know that's not true.
>
> Imagine a system with multiple subscriptions, some of them mostly
> replicating OLTP changes, but one or two replicating tables that are
> updated in batches. What we'd have is to allow higher limit for the
> batch subscriptions, but much lower limit for the OLTP ones (which they
> should never hit in practice).
>
This point is not clear to me. The changes are recorded in
ReorderBuffer which doesn't have any filtering aka it will have all
the changes irrespective of the subscriber. How will it make a
difference to have different limits?
> With a single global GUC, you'll either have a high value - risking
> OOM when the OLTP subscriptions happen to decode a batch update, or a
> low value affecting the batch subscriotions.
>
> It's not strictly necessary (and we already have such limit), so I'm OK
> with treating it as an enhancement for the future.
>
I am fine too if its usage is clear. I might be missing something here.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-23 10:50:33 |
Message-ID: | CAA4eK1KZ56PoT--pCXNyE-ebrXjWvOz_ntqhqBQs-8fn6S9m2w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Oct 23, 2019 at 12:32 AM Alexey Kondratov
<a(dot)kondratov(at)postgrespro(dot)ru> wrote:
>
> On 22.10.2019 20:22, Tomas Vondra wrote:
> >
> > I think the patch should do the simplest thing possible, i.e. what it
> > does today. Otherwise we'll never get it committed.
> >
>
> I have to agree with Tomas, that keeping things as simple as possible
> should be a main priority right now. Otherwise, the entire patch set
> will pass next release cycle without being committed at least partially.
> In the same time, it resolves important problem from my perspective. It
> moves I/O overhead from primary to replica using large transactions
> streaming, which is a nice to have feature I guess.
>
> Later it would be possible to replace logical apply worker with
> bgworkers pool in a separated patch, if we decide that it is a viable
> solution. Anyway, regarding the Amit's questions:
>
> - I doubt that maintaining a separate buffer on the apply side before
> spilling to disk would help enough. We already have ReorderBuffer with
> logical_work_mem limit, and if we exceeded that limit on the sender
> side, then most probably we exceed it on the applier side as well,
>
I think on the sender side, the limit is for un-filtered changes
(which means on the ReorderBuffer which has all the changes) whereas,
on the receiver side, we will only have the requested changes which
can make a difference?
> excepting the case when this new buffer size will be significantly
> higher then logical_work_mem to keep multiple open xacts.
>
I am not sure but I think we can have different controlling parameters
on the subscriber-side.
> - I still think that we should optimize database for commits, not
> rollbacks. BGworkers pool is dramatically slower for rollbacks-only
> load, though being at least twice as faster for commits-only. I do not
> know how it will perform with real life load, but this drawback may be
> inappropriate for such a general purpose database like Postgres.
>
> - Tomas' implementation of streaming with spilling does not have this
> bias between commits/aborts. However, it has a noticeable performance
> drop (~x5 slower compared with master [1]) for large transaction
> consisting of many small rows. Although it is not of an order of
> magnitude slower.
>
Did you ever identify the reason why it was slower in that case? I
can see the numbers shared by you and Dilip which shows that the
BGWorker pool is a really good idea and will work great for
commit-mostly workload whereas the numbers without that are not very
encouraging, maybe we have not benchmarked enough. This is the reason
I am trying to see if we can do something to get the benefits similar
to what is shown by your idea.
I am not against doing something simple for the first version and then
enhance it later, but it won't be good if we commit it with regression
in some typical cases and depend on the user to use it when it seems
favorable to its case. Also, sometimes it becomes difficult to
generate enthusiasm to enhance the feature once the main patch is
committed. I am not telling that always happens or will happen in
this case. It is better if we put some energy and get things as good
as possible in the first go itself. I am as much interested as you,
Tomas or others are, otherwise, I wouldn't have spent a lot of time on
this to disentangle it from 2PC patch which seems to get stalled due
to lack of interest.
> Another thing is it that about a year ago I have found some problems
> with MVCC/visibility and fixed them somehow [1]. If I get it correctly
> Tomas adapted some of those fixes into his patch set, but I think that
> this part should be reviewed carefully again.
>
Agreed, I have read your emails and could see that you have done very
good work on this project along with Tomas. But unfortunately, it
didn't get committed. At this stage, we are working on just the first
part of the patch which is to allow the data to spill once it crosses
the logical_decoding_work_mem on the master side. I think we need
more problems to discuss and solve once that is done.
> I would be glad to check
> it, but now I am a little bit confused with all the patch set variants
> in the thread. Which is the last one? Is it still dependent on 2pc decoding?
>
I think the latest patches posted by Dilip are not dependent on
logical decoding, but I haven't studied them yet. You can find those
at [1][2]. As per discussion in this thread, we are also trying to
see if we can make some part of the patch-series committed first, the
latest patches corresponding to which are posted at [3].
[1] - https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-vT%2B42xRbkw%3DhBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-24 13:37:12 |
Message-ID: | CAA4eK1+qwkFonAWovoOtTw7TiiyFroP9_KmCM8fQWKNNpXvK+w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
>
I was wondering whether we have checked the code coverage after this
patch? Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1]. After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same. If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters. If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-30 04:08:00 |
Message-ID: | CALDaNm02dKYU6Kt8we9WeGgYWzpYvTSPkxU9hXqwzvCkNGATnw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I think the patch should do the simplest thing possible, i.e. what it
> does today. Otherwise we'll never get it committed.
>
I found a couple of crashes while reviewing and testing flushing of
open transaction data:
Issue 1:
#0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332
#6 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x19af990,
record=0x19afc50) at decode.c:121
#7 0x0000000000b7109e in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6f5e4 in WalSndLoop (send_data=0xb70f77
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6c7e1 in StartLogicalReplication (cmd=0x1983168) at
walsender.c:1128
#10 0x0000000000b6da6f in exec_replication_command
(cmd_string=0x18f70a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
These failures come randomly.
I'm not able to reproduce this issue with simple test case.
I have attached the test case which I used to test.
I will further try to find a scenario which could reproduce consistently.
Posting it so that it can help someone in identifying the problem
parallelly through code review by experts.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
mix_data_test.c | text/x-c-code | 7.1 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-10-30 07:48:59 |
Message-ID: | CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt=DYS=jhH+jiCoBODdaw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
I have noticed one more problem in the logic of setting the logical
decoding work mem from the create subscription command. Suppose in
subscription command we don't give the work mem then it sends the
garbage value to the walsender and the walsender overwrite its value
with the garbage value. After investigating a bit I have found the
reason for the same.
@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
appendStringInfo(&cmd, "proto_version '%u'",
options->proto.logical.proto_version);
+ appendStringInfo(&cmd, ", work_mem '%d'",
+ options->proto.logical.work_mem);
I think the problem is we are unconditionally sending the work_mem as
part of the CREATE REPLICATION SLOT, without checking whether it's
valid or not.
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
sub->name = pstrdup(NameStr(subform->subname));
sub->owner = subform->subowner;
sub->enabled = subform->subenabled;
+ sub->workmem = subform->subworkmem;
Another problem is that there is no handling if the subform->subworkmem is NULL.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-04 09:13:38 |
Message-ID: | CAGz5QC+WsBwdY3=qH6qZ38siocF1T1Ub7xB0JnPZiDjrOuFLTQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hello hackers,
I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):
postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 1000000) s(i);
I've kept the publisher and subscriber in two different system.
HEAD:
With 1000000 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 10000000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442
With the memory accounting patch, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23
With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-04 10:02:33 |
Message-ID: | CAFiTN-uvs1jO4QTvnR-UDxZHuE47-tSLc-BRzayjzyA0Zyz0pg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 4, 2019 at 2:43 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> Hello hackers,
>
> I've done some performance testing of this feature. Following is my
> test case (taken from an earlier thread):
>
> postgres=# CREATE TABLE large_test (num1 bigint, num2 double
> precision, num3 double precision);
> postgres=# \timing on
> postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
> num2, num3) SELECT round(random()*10), random(), random()*142 FROM
> generate_series(1, 1000000) s(i);
>
> I've kept the publisher and subscriber in two different system.
>
> HEAD:
> With 1000000 tuples,
> Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
> With 10000000 tuples (10 times more),
> Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442
>
> With the memory accounting patch, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
> 9648.223 ms (00:09.648), Spill count: 2315
> logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
> 9895.161 ms (00:09.895), Spill count 3
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
> 105761.978 ms (01:45.762), Spill count: 23149
> logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
> 89985.342 ms (01:29.985), Spill count: 23
>
> With logical decoding of in-progress transactions patch and with
> streaming on, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
> 20779.601 ms (00:20.780)
> logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
> 9559.953 ms (00:09.560)
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
> 196261.892 ms (03:16.262)
> logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
> 90079.286 ms (01:30.079)
So your result shows that with "streaming on", performance is
degrading? By any chance did you try to see where is the bottleneck?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, a(dot)kondratov(at)postgrespro(dot)ru, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-04 10:05:12 |
Message-ID: | CAGz5QCKwfgGnuzzJ3wRi=4Afpgen4Ar7gJqJqEsSzDfM1+zrAw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> So your result shows that with "streaming on", performance is
> degrading? By any chance did you try to see where is the bottleneck?
>
Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-04 10:16:48 |
Message-ID: | CALDaNm3v-UGm3x0c9rak6VF42OGEHySUoHOD+kM3enu1uEHG=A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> >
>
> I was wondering whether we have checked the code coverage after this
> patch? Previously, the existing tests seem to be covering most parts
> of the function ReorderBufferSerializeTXN [1]. After this patch, the
> timing to call ReorderBufferSerializeTXN will change, so that might
> impact the testing of the same. If it is already covered, then I
> would like to either add a new test or extend existing test with the
> help of new spill counters. If it is not getting covered, then we
> need to think of extending the existing test or write a new test to
> cover the function ReorderBufferSerializeTXN.
>
I have run the tests with coverage and found that
ReorderBufferSerializeTXN is not being hit.
The reason it is not being hit is because of the following check in
ReorderBufferCheckMemoryLimit:
/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;
Previously the tests from contrib/test_decoding could hit
ReorderBufferSerializeTXN function.
I'm checking if we can modify the test or add new test to hit
ReorderBufferSerializeTXN function.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-04 11:51:57 |
Message-ID: | CAA4eK1KW8KzhnNiLk3ayKUA4CkVNb_fm8USqXDw0nUK_0togJg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > I think the patch should do the simplest thing possible, i.e. what it
> > does today. Otherwise we'll never get it committed.
> >
> I found a couple of crashes while reviewing and testing flushing of
> open transaction data:
>
Thanks for doing these tests. However, I don't think these issues are
anyway related to this patch. It seems to be base code issues
manifested by this patch. See my analysis below.
> Issue 1:
> #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
> #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec5390 in ExceptionalCondition
> (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> "FailedAssertion",
> fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> off=64) at ../../../../src/include/lib/ilist.h:458
> #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> oldestRunningXid=3834) at reorderbuffer.c:1966
> #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> buf=0x7ffcbc26dc50) at decode.c:332
>
This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty. I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.
>
> Issue 2:
> #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
> fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>
This seems to be again the problem with base code as we don't update
the final_lsn for subtransactions during ReorderBufferAbortOld. This
can also be reproduced with some hacking in code or via debugger in a
similar way as explained for the previous problem but with a
difference that there must be subtransaction involved in this case.
> #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> These failures come randomly.
> I'm not able to reproduce this issue with simple test case.
Yeah, it appears to be difficult to reproduce unless you hack the code
to serialize every change or use debugger to forcefully flush the
changes every time.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-05 05:38:02 |
Message-ID: | CAFiTN-uR5e2JvrMUUjRLSYTafm9tqrir2aZSXBfACWFFECkwjg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests. However, I don't think these issues are
> anyway related to this patch. It seems to be base code issues
> manifested by this patch. See my analysis below.
>
> > Issue 1:
> > #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2 0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> > fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty. I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
I think you might need to kill the server after all changes are
serialized otherwise normal abort will hit the ReorderBufferAbort and
that will remove your ReorderBufferTXN entry and you will never hit
this case.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-06 06:03:43 |
Message-ID: | CALDaNm2-Q3Ua0c-Ja7Fe7pmaab9YNEUG1e1eDDLmqTFPCm3RoA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 4, 2019 at 3:46 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> > >
> >
> > I was wondering whether we have checked the code coverage after this
> > patch? Previously, the existing tests seem to be covering most parts
> > of the function ReorderBufferSerializeTXN [1]. After this patch, the
> > timing to call ReorderBufferSerializeTXN will change, so that might
> > impact the testing of the same. If it is already covered, then I
> > would like to either add a new test or extend existing test with the
> > help of new spill counters. If it is not getting covered, then we
> > need to think of extending the existing test or write a new test to
> > cover the function ReorderBufferSerializeTXN.
> >
> I have run the tests with coverage and found that
> ReorderBufferSerializeTXN is not being hit.
> The reason it is not being hit is because of the following check in
> ReorderBufferCheckMemoryLimit:
> /* bail out if we haven't exceeded the memory limit */
> if (rb->size < logical_decoding_work_mem * 1024L)
> return;
> Previously the tests from contrib/test_decoding could hit
> ReorderBufferSerializeTXN function.
> I'm checking if we can modify the test or add new test to hit
> ReorderBufferSerializeTXN function.
I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_decoding_work_mem-configuration-for-test.patch | text/x-patch | 747 bytes |
coverage.tar | application/x-tar | 1.3 MB |
From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-06 11:03:12 |
Message-ID: | CALDaNm1pXV7qpFiX77Li-jyPP+LOKcYuu2EvGeGeNpxGeAZGPA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests. However, I don't think these issues are
> anyway related to this patch. It seems to be base code issues
> manifested by this patch. See my analysis below.
>
> > Issue 1:
> > #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2 0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> > fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty. I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
> >
> > Issue 2:
> > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2 0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> >
>
> This seems to be again the problem with base code as we don't update
> the final_lsn for subtransactions during ReorderBufferAbortOld. This
> can also be reproduced with some hacking in code or via debugger in a
> similar way as explained for the previous problem but with a
> difference that there must be subtransaction involved in this case.
>
> > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > These failures come randomly.
> > I'm not able to reproduce this issue with simple test case.
>
> Yeah, it appears to be difficult to reproduce unless you hack the code
> to serialize every change or use debugger to forcefully flush the
> changes every time.
>
Thanks Amit for your analysis, I was able to reproduce the above issue
consistently by making some code changes and with help of debugger. I
did one change so that it flushes every time instead of flushing after
the buffer size exceeds the logical_decoding_work_mem, attached one of
the transactions and called abort. When the server restarts after
abort, this problem occurs consistently. I could reproduce the issue
with base code also. It seems like this issue is not an issue of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer patch and
exists from base code. I will post the issue in hackers with details.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-07 09:49:45 |
Message-ID: | CAA4eK1Kdmi6VVguKEHV6Ho2isCPVFdQtt0WLsK10fiuE59_0Yw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> I have made one change to the configuration file in
> contrib/test_decoding directory, with that the coverage seems to be
> fine. I have seen that the coverage is almost like the code before
> applying the patch. I have attached the test change and the coverage
> report for reference. Coverage report includes the core logical work
> memory files for base code and by applying
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
> 0002-Track-statistics-for-spilling patches.
>
Thanks, I have incorporated your test changes and modified the two
patches. Please see attached.
Changes:
---------------
1. In guc.c, we should include reorderbuffer.h, not logical.h as we
define logical_decoding_work_mem in earlier.
2.
+ * To limit the amount of memory used by decoded changes, we track memory
+ * used at the reorder buffer level (i.e. total amount of memory), and for
+ * each toplevel transaction. When the total amount of used memory exceeds
+ * the limit, the toplevel transaction consuming the most memory is then
+ * serialized to disk.
In the above comments, removed 'toplevel' as we track memory usage for
both toplevel and subtransactions.
3. There were still a few mentions of streaming which I have removed.
4. In the docs, the type for stats spill_* was integer whereas it
should be bigint.
5.
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+ ReorderBuffer *rb = ctx->reorder;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+
+ MyWalSnd->spillTxns = rb->spillTxns;
+ MyWalSnd->spillCount = rb->spillCount;
+ MyWalSnd->spillBytes = rb->spillBytes;
+
+ elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+ rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
Changed the above elog to DEBUG1 as otherwise it was getting printed
very frequently. I think we can make it DEBUG2 if we want.
6. There was an extra space in rules.out due to which test was
failing. I have fixed it.
What do you think?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch | application/octet-stream | 21.6 KB |
0002-Track-statistics-for-spilling.patch | application/octet-stream | 11.1 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-07 10:19:51 |
Message-ID: | CAFiTN-uF2=+U56BYBLerU-gjqvhnxx9=ak9NkGUoYnKXx7mn9w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > I have made one change to the configuration file in
> > contrib/test_decoding directory, with that the coverage seems to be
> > fine. I have seen that the coverage is almost like the code before
> > applying the patch. I have attached the test change and the coverage
> > report for reference. Coverage report includes the core logical work
> > memory files for base code and by applying
> > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
> > 0002-Track-statistics-for-spilling patches.
> >
>
> Thanks, I have incorporated your test changes and modified the two
> patches. Please see attached.
>
> Changes:
> ---------------
> 1. In guc.c, we should include reorderbuffer.h, not logical.h as we
> define logical_decoding_work_mem in earlier.
Yeah Right.
>
> 2.
> + * To limit the amount of memory used by decoded changes, we track memory
> + * used at the reorder buffer level (i.e. total amount of memory), and for
> + * each toplevel transaction. When the total amount of used memory exceeds
> + * the limit, the toplevel transaction consuming the most memory is then
> + * serialized to disk.
>
> In the above comments, removed 'toplevel' as we track memory usage for
> both toplevel and subtransactions.
Correct.
>
> 3. There were still a few mentions of streaming which I have removed.
>
ok
> 4. In the docs, the type for stats spill_* was integer whereas it
> should be bigint.
ok
>
> 5.
> +UpdateSpillStats(LogicalDecodingContext *ctx)
> +{
> + ReorderBuffer *rb = ctx->reorder;
> +
> + SpinLockAcquire(&MyWalSnd->mutex);
> +
> + MyWalSnd->spillTxns = rb->spillTxns;
> + MyWalSnd->spillCount = rb->spillCount;
> + MyWalSnd->spillBytes = rb->spillBytes;
> +
> + elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
> + rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
>
> Changed the above elog to DEBUG1 as otherwise it was getting printed
> very frequently. I think we can make it DEBUG2 if we want.
Yeah, it should not be WARNING.
>
> 6. There was an extra space in rules.out due to which test was
> failing. I have fixed it.
My Bad. I have induced while separating out the changes for the spilling.
> What do you think?
I have reviewed your changes and looks fine to me.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-07 11:43:13 |
Message-ID: | CAA4eK1JM0=RwODZQrn8DTQ3dbcb9xwKDdHCmVOryAk_xoKf9Nw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 7, 2019 at 3:50 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > What do you think?
> I have reviewed your changes and looks fine to me.
>
Okay, thanks. I am also happy with the two patches I have posted in
my last email [1].
Tomas, would you like to take a look at those patches and commit them
if you are happy or would you like me to do the same?
Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.
> The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
SET logical_decoding_work_mem = '128kB'
> to trigger very aggressive streaming. The minimum value is 64kB.
I think this patch doesn't contain streaming, so we either need to
reword it or remove it.
B.
> The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
> subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
We need to reword this as we have decided to remove the setting from
the subscription side as of now.
2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
3. I think we need catversion bump for the second patch.
4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-12 10:42:35 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 04.11.2019 13:05, Kuntal Ghosh wrote:
> On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> So your result shows that with "streaming on", performance is
>> degrading? By any chance did you try to see where is the bottleneck?
>>
> Right. But, as we increase the logical_decoding_work_mem, the
> performance improves. I've not analyzed the bottleneck yet. I'm
> looking into the same.
My guess is that 64 kB is just too small value. In the table schema used
for tests every rows takes at least 24 bytes for storing column values.
Thus, with this logical_decoding_work_mem value the limit should be hit
after about 2500+ rows, or about 400 times during transaction of 1000000
rows size.
It is just too frequent, while ReorderBufferStreamTXN includes a whole
bunch of logic, e.g. it always starts internal transaction:
/*
* Decoding needs access to syscaches et al., which in turn use
* heavyweight locks and such. Thus we need to have enough state around to
* keep track of those. The easiest way is to simply use a transaction
* internally. That also allows us to easily enforce that nothing writes
* to the database by checking for xid assignments. ...
*/
Also it issues separated stream_start/stop messages around each streamed
transaction chunk. So if streaming starts and stops too frequently it
adds additional overhead and may even interfere with current in-progress
transaction.
If I get it correctly, then it is rather expected with too small values
of logical_decoding_work_mem. Probably it may be optimized, but I am not
sure that it is worth doing right now.
Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-12 11:34:05 |
Message-ID: | CAGz5QCL8uBV1af--prpETYamjZJaqF=+mbVSeyWFJAqETvVDaw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Nov 12, 2019 at 4:12 PM Alexey Kondratov
<a(dot)kondratov(at)postgrespro(dot)ru> wrote:
>
> On 04.11.2019 13:05, Kuntal Ghosh wrote:
> > On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >> So your result shows that with "streaming on", performance is
> >> degrading? By any chance did you try to see where is the bottleneck?
> >>
> > Right. But, as we increase the logical_decoding_work_mem, the
> > performance improves. I've not analyzed the bottleneck yet. I'm
> > looking into the same.
>
> My guess is that 64 kB is just too small value. In the table schema used
> for tests every rows takes at least 24 bytes for storing column values.
> Thus, with this logical_decoding_work_mem value the limit should be hit
> after about 2500+ rows, or about 400 times during transaction of 1000000
> rows size.
>
> It is just too frequent, while ReorderBufferStreamTXN includes a whole
> bunch of logic, e.g. it always starts internal transaction:
>
> /*
> * Decoding needs access to syscaches et al., which in turn use
> * heavyweight locks and such. Thus we need to have enough state around to
> * keep track of those. The easiest way is to simply use a transaction
> * internally. That also allows us to easily enforce that nothing writes
> * to the database by checking for xid assignments. ...
> */
>
> Also it issues separated stream_start/stop messages around each streamed
> transaction chunk. So if streaming starts and stops too frequently it
> adds additional overhead and may even interfere with current in-progress
> transaction.
>
Yeah, I've also found the same. With stream_start/stop message, it
writes 1 byte of checksum and 4 bytes of number of sub-transactions
which increases the write amplification significantly.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-13 12:25:04 |
Message-ID: | CAA4eK1L6BGb9BaK929Z2Z40F6Hr=0r=UhQthDFidKXf73doiqw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
As mentioned by me a few days back that the first patch in this series
is ready to go [1] (I am hoping Tomas will pick it up), so I have
started the review of other patches
Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction. This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array. See comments in procarray.c (KnownAssignedTransactionIds
sub-module). Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?
2.
+#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */
This doesn't seem to be used in this patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-14 04:07:30 |
Message-ID: | CAFiTN-tVFsktyTkJb=j72o7cK=yeNA9b=LqRUGvCPdEz66rYKw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
>
> As mentioned by me a few days back that the first patch in this series
> is ready to go [1] (I am hoping Tomas will pick it up), so I have
> started the review of other patches
>
> Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> -------------------------------------------------------------------------------------------------
> 1. This patch adds the top_xid in WAL whenever the first time WAL for
> a subtransaction XID is written to correctly decode the changes of
> in-progress transaction. This patch also removes logging and applying
> WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
> of that, it prunes KnownAssignedXids to prevent overflow of that
> array. See comments in procarray.c (KnownAssignedTransactionIds
> sub-module). Can you please explain how after removing the WAL for
> XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> and there is no impact of same?
It seems like a problem to me as well. One option could be that
since now we are adding the top transaction id in the first WAL of the
subtransaction we can directly update the pg_subtrans and avoid adding
sub transaction id in the KnownAssignedXids and mark it as
lastOverflowedXid. But, I don't think we should go in that direction
otherwise it will impact the performance of visibility check on the
hot-standby. Let's see what Tomas has in mind.
>
> 2.
> +#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */
>
> This doesn't seem to be used in this patch.
>
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-14 06:40:07 |
Message-ID: | CAA4eK1JNhoF8WT2ObMaLqXpqOwOS-oe8dtq3n-FSY4ptWF7BTQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> >
> > As mentioned by me a few days back that the first patch in this series
> > is ready to go [1] (I am hoping Tomas will pick it up), so I have
> > started the review of other patches
> >
> > Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> > -------------------------------------------------------------------------------------------------
> > 1. This patch adds the top_xid in WAL whenever the first time WAL for
> > a subtransaction XID is written to correctly decode the changes of
> > in-progress transaction. This patch also removes logging and applying
> > WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
> > of that, it prunes KnownAssignedXids to prevent overflow of that
> > array. See comments in procarray.c (KnownAssignedTransactionIds
> > sub-module). Can you please explain how after removing the WAL for
> > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> > and there is no impact of same?
>
> It seems like a problem to me as well. One option could be that
> since now we are adding the top transaction id in the first WAL of the
> subtransaction we can directly update the pg_subtrans and avoid adding
> sub transaction id in the KnownAssignedXids and mark it as
> lastOverflowedXid.
>
Hmm, I am not sure if we can do that easily because I think in
RecordKnownAssignedTransactionIds, we add those based on the gap via
KnownAssignedXidsAdd and only remove them later while applying WAL for
XLOG_XACT_ASSIGNMENT. I think if we really want to go in this
direction then for each WAL record we need to check if it has
XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
ProcArrayApplyXidAssignment() with the required information. I think
this line of attack has WAL overhead both on master whenever
subtransactions are involved and also on hot-standby for doing the
work for each subtransaction separately. The WAL apply needs to
acquire and release PROCArrayLock in exclusive mode for each
subtransaction whereas now it does it once for
PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
with queries running on standby.
The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
mechanism (WAL logging and apply of same on hot-standby) as it is and
additionally log top_xid the first time when WAL is written for a
subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the
same for logical decoding. The advantage of this approach is that we
will incur the overhead of additional transactionid only when required
especially not with default server configuration.
Thoughts?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-14 10:09:56 |
Message-ID: | CAFiTN-u2qG5eBPNeciW7UCxGasCbkx_X5n-YdMkvOA4b90w5=g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 14, 2019 at 12:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > >
> > > As mentioned by me a few days back that the first patch in this series
> > > is ready to go [1] (I am hoping Tomas will pick it up), so I have
> > > started the review of other patches
> > >
> > > Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> > > -------------------------------------------------------------------------------------------------
> > > 1. This patch adds the top_xid in WAL whenever the first time WAL for
> > > a subtransaction XID is written to correctly decode the changes of
> > > in-progress transaction. This patch also removes logging and applying
> > > WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay
> > > of that, it prunes KnownAssignedXids to prevent overflow of that
> > > array. See comments in procarray.c (KnownAssignedTransactionIds
> > > sub-module). Can you please explain how after removing the WAL for
> > > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> > > and there is no impact of same?
> >
> > It seems like a problem to me as well. One option could be that
> > since now we are adding the top transaction id in the first WAL of the
> > subtransaction we can directly update the pg_subtrans and avoid adding
> > sub transaction id in the KnownAssignedXids and mark it as
> > lastOverflowedXid.
> >
>
> Hmm, I am not sure if we can do that easily because I think in
> RecordKnownAssignedTransactionIds, we add those based on the gap via
> KnownAssignedXidsAdd and only remove them later while applying WAL for
> XLOG_XACT_ASSIGNMENT. I think if we really want to go in this
> direction then for each WAL record we need to check if it has
> XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
> ProcArrayApplyXidAssignment() with the required information. I think
> this line of attack has WAL overhead both on master whenever
> subtransactions are involved and also on hot-standby for doing the
> work for each subtransaction separately. The WAL apply needs to
> acquire and release PROCArrayLock in exclusive mode for each
> subtransaction whereas now it does it once for
> PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
> with queries running on standby.
Right
>
> The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
> mechanism (WAL logging and apply of same on hot-standby) as it is and
> additionally log top_xid the first time when WAL is written for a
> subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the
> same for logical decoding. The advantage of this approach is that we
> will incur the overhead of additional transactionid only when required
> especially not with default server configuration.
>
> Thoughts?
The idea seems reasonable to me.
Apart from this, I have another question in
0003-Issue-individual-invalidations-with-wal_level-logical.patch
@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
{
AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
dbId, relId);
+
+ /* Issue an invalidation WAL record (when wal_level=logical) */
+ if (XLogLogicalInfoActive())
+ {
+ SharedInvalidationMessage msg;
+
+ msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+ msg.sn.dbId = dbId;
+ msg.sn.relId = relId;
+
+ LogLogicalInvalidations(1, &msg, false);
+ }
}
I am not sure why do we need to explicitly WAL log the snapshot
invalidation? because this is logged for invalidating the catalog
snapshot and for logical decoding we use HistoricSnapshot, not the
catalog snapshot. I might be missing something?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-15 10:20:10 |
Message-ID: | CAA4eK1J5r4m3aRN=MU=xLMNG9xDyot90+meptkogYEyDBGWKig@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>
> Apart from this, I have another question in
> 0003-Issue-individual-invalidations-with-wal_level-logical.patch
>
> @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
> {
> AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
> dbId, relId);
> +
> + /* Issue an invalidation WAL record (when wal_level=logical) */
> + if (XLogLogicalInfoActive())
> + {
> + SharedInvalidationMessage msg;
> +
> + msg.sn.id = SHAREDINVALSNAPSHOT_ID;
> + msg.sn.dbId = dbId;
> + msg.sn.relId = relId;
> +
> + LogLogicalInvalidations(1, &msg, false);
> + }
> }
>
> I am not sure why do we need to explicitly WAL log the snapshot
> invalidation? because this is logged for invalidating the catalog
> snapshot and for logical decoding we use HistoricSnapshot, not the
> catalog snapshot.
>
I think it has been logged because without this patch as well we log
all the invalidation messages at commit time and process them during
decoding. However, I agree that this particular invalidation message
is not required for logical decoding for the reason you mentioned. I
think as we are explicitly logging invalidations, so it is better to
avoid this if we can.
Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;
Here, why are we executing messages individually? Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?
2.
@@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* although we don't check the memory limit when restoring the changes in
* this branch (we only do that when initially queueing the changes after
* decoding), because we will release the changes later, and that will
- * update the accounting too (subtracting the size from the counters).
- * And we don't want to underflow there.
+ * update the accounting too (subtracting the size from the counters). And
+ * we don't want to underflow there.
*/
This seems like an unrelated change.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-15 10:31:15 |
Message-ID: | CAFiTN-vD9ui2pgewJv+r=Y4brYdRg--gSLy_2XG_zFRWiiu5eA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> >
> > Apart from this, I have another question in
> > 0003-Issue-individual-invalidations-with-wal_level-logical.patch
> >
> > @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
> > {
> > AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
> > dbId, relId);
> > +
> > + /* Issue an invalidation WAL record (when wal_level=logical) */
> > + if (XLogLogicalInfoActive())
> > + {
> > + SharedInvalidationMessage msg;
> > +
> > + msg.sn.id = SHAREDINVALSNAPSHOT_ID;
> > + msg.sn.dbId = dbId;
> > + msg.sn.relId = relId;
> > +
> > + LogLogicalInvalidations(1, &msg, false);
> > + }
> > }
> >
> > I am not sure why do we need to explicitly WAL log the snapshot
> > invalidation? because this is logged for invalidating the catalog
> > snapshot and for logical decoding we use HistoricSnapshot, not the
> > catalog snapshot.
> >
>
> I think it has been logged because without this patch as well we log
> all the invalidation messages at commit time and process them during
> decoding. However, I agree that this particular invalidation message
> is not required for logical decoding for the reason you mentioned. I
> think as we are explicitly logging invalidations, so it is better to
> avoid this if we can.
Ok
>
> Few other comments on this patch:
> 1.
> + case REORDER_BUFFER_CHANGE_INVALIDATION:
> +
> + /*
> + * Execute the invalidation message locally.
> + *
> + * XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
> + */
> + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> + break;
>
> Here, why are we executing messages individually? Can't we just
> follow what we do in DecodeCommit which is to record the invalidations
> in ReorderBufferTXN as we encounter them and then allow them to
> execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> reason why we don't do ReorderBufferXidSetCatalogChanges when we
> receive any invalidation message?
IMHO, the reason is that in DecodeCommit, we get all the invalidation
at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
know which invalidation message to execute so for being safe we have
to execute all. But, since we are logging all invalidation
individually, we exactly know at this stage which cache to invalidate.
So it is better to only invalidate required cache not all.
>
> 2.
> @@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
> * although we don't check the memory limit when restoring the changes in
> * this branch (we only do that when initially queueing the changes after
> * decoding), because we will release the changes later, and that will
> - * update the accounting too (subtracting the size from the counters).
> - * And we don't want to underflow there.
> + * update the accounting too (subtracting the size from the counters). And
> + * we don't want to underflow there.
> */
>
> This seems like an unrelated change.
Indeed.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-15 10:49:27 |
Message-ID: | CAA4eK1JN3Cn_ZsyS3hKHEqGsvHvhuBgyLpKb+Qeefuub2r0Krw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Few other comments on this patch:
> > 1.
> > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > +
> > + /*
> > + * Execute the invalidation message locally.
> > + *
> > + * XXX Do we need to care about relcacheInitFileInval and
> > + * the other fields added to ReorderBufferChange, or just
> > + * about the message itself?
> > + */
> > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > + break;
> >
> > Here, why are we executing messages individually? Can't we just
> > follow what we do in DecodeCommit which is to record the invalidations
> > in ReorderBufferTXN as we encounter them and then allow them to
> > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > receive any invalidation message?
> IMHO, the reason is that in DecodeCommit, we get all the invalidation
> at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
> know which invalidation message to execute so for being safe we have
> to execute all. But, since we are logging all invalidation
> individually, we exactly know at this stage which cache to invalidate.
> So it is better to only invalidate required cache not all.
>
In that case, invalidations can be processed multiple times, the first
time when these individual WAL logs for invalidation are processed and
then later at commit time when we accumulate all invalidation messages
and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
Can we avoid to execute invalidations from other places after this
patch which also includes executing them as part of XLOG_INVALIDATIONS
processing?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-16 13:14:23 |
Message-ID: | CAA4eK1L5u=L42CjAb44CrgH4c10-ew8Q=1SqqJ0QSVNr-a6XaQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> Some notes before commit:
> --------------------------------------
> 1.
> Commit message need to be changed for the first patch
> -------------------------------------------------------------------------
> A.
> > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
>
> SET logical_decoding_work_mem = '128kB'
>
> > to trigger very aggressive streaming. The minimum value is 64kB.
>
> I think this patch doesn't contain streaming, so we either need to
> reword it or remove it.
>
> B.
> > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
> > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
>
> We need to reword this as we have decided to remove the setting from
> the subscription side as of now.
>
> 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
>
I have made these modifications and additionally ran pgindent.
> 4. I think we can combine both patches and commit as one patch, but it
> is okay to commit them separately as well.
>
I am not sure if this is a good idea, so still kept them as separate.
Tomas, do let me know if you want to commit these or if you have any
comments, otherwise, I will commit these on Tuesday (19-Nov)?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.nov16.patch | application/octet-stream | 21.8 KB |
0002-Track-statistics-for-spilling-of-changes-from-Reorde.nov16.patch | application/octet-stream | 11.5 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-18 11:32:32 |
Message-ID: | CAFiTN-v+Yzrg8kr4uUzWKV3LOgavgmzr54+6CXT+p30XAAjRMw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Few other comments on this patch:
> > > 1.
> > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > +
> > > + /*
> > > + * Execute the invalidation message locally.
> > > + *
> > > + * XXX Do we need to care about relcacheInitFileInval and
> > > + * the other fields added to ReorderBufferChange, or just
> > > + * about the message itself?
> > > + */
> > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > + break;
> > >
> > > Here, why are we executing messages individually? Can't we just
> > > follow what we do in DecodeCommit which is to record the invalidations
> > > in ReorderBufferTXN as we encounter them and then allow them to
> > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > receive any invalidation message?
I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit. Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes. So I think there is
no point in setting that flag every time we get an invalidation
message.
> > IMHO, the reason is that in DecodeCommit, we get all the invalidation
> > at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
> > know which invalidation message to execute so for being safe we have
> > to execute all. But, since we are logging all invalidation
> > individually, we exactly know at this stage which cache to invalidate.
> > So it is better to only invalidate required cache not all.
> >
>
> In that case, invalidations can be processed multiple times, the first
> time when these individual WAL logs for invalidation are processed and
> then later at commit time when we accumulate all invalidation messages
> and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
> Can we avoid to execute invalidations from other places after this
> patch which also includes executing them as part of XLOG_INVALIDATIONS
> processing?
I think we can avoid invalidation which is done as part of
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. I need to further
investigate the invalidation which is done as part of
XLOG_INVALIDATIONS.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-19 11:53:00 |
Message-ID: | CAA4eK1+GgWdK=VJXQYJv2S6A8VNGa0mMrzq1nG-R0b0y0-repg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > Few other comments on this patch:
> > > > 1.
> > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > +
> > > > + /*
> > > > + * Execute the invalidation message locally.
> > > > + *
> > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > + * the other fields added to ReorderBufferChange, or just
> > > > + * about the message itself?
> > > > + */
> > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > + break;
> > > >
> > > > Here, why are we executing messages individually? Can't we just
> > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > receive any invalidation message?
>
> I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> commit. Because this is required to add any committed transaction to
> the snapshot if it has done any catalog changes.
>
Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions. So, I think that it would
be required earlier (before commit) as well.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-19 11:55:33 |
Message-ID: | CAA4eK1Kd0Dw1+R+vgN1ScJn5tJZdbDv9siS-mvko-gncAAHang@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > Some notes before commit:
> > --------------------------------------
> > 1.
> > Commit message need to be changed for the first patch
> > -------------------------------------------------------------------------
> > A.
> > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
> >
> > SET logical_decoding_work_mem = '128kB'
> >
> > > to trigger very aggressive streaming. The minimum value is 64kB.
> >
> > I think this patch doesn't contain streaming, so we either need to
> > reword it or remove it.
> >
> > B.
> > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
> > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
> >
> > We need to reword this as we have decided to remove the setting from
> > the subscription side as of now.
> >
> > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
> >
>
> I have made these modifications and additionally ran pgindent.
>
> > 4. I think we can combine both patches and commit as one patch, but it
> > is okay to commit them separately as well.
> >
>
> I am not sure if this is a good idea, so still kept them as separate.
>
I have committed the first patch. I will commit the second one
related to stats of spilled xacts on Thursday. The second patch needs
catalog version bump as well because we are modifying the catalog
contents in that patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-20 05:45:40 |
Message-ID: | CAFiTN-u_rrLwKSz2eV_TbjAhd+32WrpkY6wjTeS1dxU7ExdjkA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > >
> > > > > Few other comments on this patch:
> > > > > 1.
> > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > +
> > > > > + /*
> > > > > + * Execute the invalidation message locally.
> > > > > + *
> > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > + * about the message itself?
> > > > > + */
> > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > + break;
> > > > >
> > > > > Here, why are we executing messages individually? Can't we just
> > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > receive any invalidation message?
> >
> > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > commit. Because this is required to add any committed transaction to
> > the snapshot if it has done any catalog changes.
> >
>
> Hmm, this is also used to build cid hash map (see
> ReorderBufferBuildTupleCidHash) which we need to use while streaming
> changes for the in-progress transactions. So, I think that it would
> be required earlier (before commit) as well.
>
Oh right, I guess I missed that part.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-20 14:52:03 |
Message-ID: | CAFiTN-vckEvo_4+Hv=fNkM1for68Lk7O4YnajAEGH3azo=Oxdw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > >
> > > > > > Few other comments on this patch:
> > > > > > 1.
> > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > +
> > > > > > + /*
> > > > > > + * Execute the invalidation message locally.
> > > > > > + *
> > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > + * about the message itself?
> > > > > > + */
> > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > + break;
> > > > > >
> > > > > > Here, why are we executing messages individually? Can't we just
> > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > receive any invalidation message?
> > >
> > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > commit. Because this is required to add any committed transaction to
> > > the snapshot if it has done any catalog changes.
> > >
> >
> > Hmm, this is also used to build cid hash map (see
> > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > changes for the in-progress transactions. So, I think that it would
> > be required earlier (before commit) as well.
> >
> Oh right, I guess I missed that part.
Attached a new rebased version of the patch set. I have fixed all
the issues discussed up-thread and agreed upon.
Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf. So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction. There are some pending issues with the existing
patch[1].
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-21 03:32:12 |
Message-ID: | CAFiTN-skxG9U-kkz-9thN_o3_yC-=S=DjObk7LbibN8iOr-HyA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Few other comments on this patch:
> > > > > > > 1.
> > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Execute the invalidation message locally.
> > > > > > > + *
> > > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > > + * about the message itself?
> > > > > > > + */
> > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > > + break;
> > > > > > >
> > > > > > > Here, why are we executing messages individually? Can't we just
> > > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > > receive any invalidation message?
> > > >
> > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > > commit. Because this is required to add any committed transaction to
> > > > the snapshot if it has done any catalog changes.
> > > >
> > >
> > > Hmm, this is also used to build cid hash map (see
> > > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > > changes for the in-progress transactions. So, I think that it would
> > > be required earlier (before commit) as well.
> > >
> > Oh right, I guess I missed that part.
>
> Attached a new rebased version of the patch set. I have fixed all
> the issues discussed up-thread and agreed upon.
>
> Pending Issues:
> 1. The default value of the logical_decoding_work_mem is set to 64kb
> in test_decoding/logical.conf. So we need to change the expected
> output files for the test decoding module.
> 2. Need to complete the patch for concurrent abort handling of the
> (sub)transaction. There are some pending issues with the existing
> patch[1].
> [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com
Apart from these there is one more issue reported upthread[2]
[2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-21 10:44:43 |
Message-ID: | CAA4eK1+3fsw0_zZCdSOHi+B-Mu_4R7RQwgiDcbdVYB9iLicOhA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Nov 19, 2019 at 5:25 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > Some notes before commit:
> > > --------------------------------------
> > > 1.
> > > Commit message need to be changed for the first patch
> > > -------------------------------------------------------------------------
> > > A.
> > > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
> > >
> > > SET logical_decoding_work_mem = '128kB'
> > >
> > > > to trigger very aggressive streaming. The minimum value is 64kB.
> > >
> > > I think this patch doesn't contain streaming, so we either need to
> > > reword it or remove it.
> > >
> > > B.
> > > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publishers on that instance, or when creating the
> > > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
> > >
> > > We need to reword this as we have decided to remove the setting from
> > > the subscription side as of now.
> > >
> > > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
> > >
> >
> > I have made these modifications and additionally ran pgindent.
> >
> > > 4. I think we can combine both patches and commit as one patch, but it
> > > is okay to commit them separately as well.
> > >
> >
> > I am not sure if this is a good idea, so still kept them as separate.
> >
>
> I have committed the first patch. I will commit the second one
> related to stats of spilled xacts on Thursday. The second patch needs
> catalog version bump as well because we are modifying the catalog
> contents in that patch.
>
Committed the second one as well. Now, we can move to a review of
patches for "streaming of in-progress transactions".
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-11-22 07:48:11 |
Message-ID: | CAFiTN-uiCmaTFg5LG6PpkQDdWvyjn3S2qidJWuVGpx=niGq1Eg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Nov 21, 2019 at 9:02 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Few other comments on this patch:
> > > > > > > > 1.
> > > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Execute the invalidation message locally.
> > > > > > > > + *
> > > > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > > > + * about the message itself?
> > > > > > > > + */
> > > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > > > + break;
> > > > > > > >
> > > > > > > > Here, why are we executing messages individually? Can't we just
> > > > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a
> > > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > > > receive any invalidation message?
> > > > >
> > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > > > commit. Because this is required to add any committed transaction to
> > > > > the snapshot if it has done any catalog changes.
> > > > >
> > > >
> > > > Hmm, this is also used to build cid hash map (see
> > > > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > > > changes for the in-progress transactions. So, I think that it would
> > > > be required earlier (before commit) as well.
> > > >
> > > Oh right, I guess I missed that part.
> >
> > Attached a new rebased version of the patch set. I have fixed all
> > the issues discussed up-thread and agreed upon.
> >
> > Pending Issues:
> > 1. The default value of the logical_decoding_work_mem is set to 64kb
> > in test_decoding/logical.conf. So we need to change the expected
> > output files for the test decoding module.
> > 2. Need to complete the patch for concurrent abort handling of the
> > (sub)transaction. There are some pending issues with the existing
> > patch[1].
> > [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com
> Apart from these there is one more issue reported upthread[2]
> [2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com
>
I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set. I have added the version number so that we
can track the changes.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-01 02:28:34 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> I have rebased the patch on the latest head and also fix the issue of
> "concurrent abort handling of the (sub)transaction." and attached as
> (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> the complete patch set. I have added the version number so that we
> can track the changes.
The patch has rotten a bit and does not apply anymore. Could you
please send a rebased version? I have moved it to next CF, waiting on
author.
--
Michael
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-02 08:31:50 |
Message-ID: | CAFiTN-sKDPk+B38UAw7O27GmfuECVT65nMZFco9MgeMj=0Vs0w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > I have rebased the patch on the latest head and also fix the issue of
> > "concurrent abort handling of the (sub)transaction." and attached as
> > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > the complete patch set. I have added the version number so that we
> > can track the changes.
>
> The patch has rotten a bit and does not apply anymore. Could you
> please send a rebased version? I have moved it to next CF, waiting on
> author.
I have rebased the patch set on the latest head.
Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390). After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN. And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema. But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent. I am still working on finding a better
solution for this if anyone has any opinion/solution about this feel
free to suggest.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-09 07:56:59 |
Message-ID: | CAFiTN-vgh4My9NpnEaxFDfzYJBtoZYf+WyqQQiS_LJ2O8APKpw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 2, 2019 at 2:01 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set. I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore. Could you
> > please send a rebased version? I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>
I have review the patch set and here are few comments/questions
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
Should we show the tuple in the streamed change like we do for the
pg_decode_change?
2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not
3.
+ XLogRecPtr prev_lsn = InvalidXLogRecPtr;
ReorderBufferChange *change;
ReorderBufferChange *specinsert = NULL;
@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
Relation relation = NULL;
Oid reloid;
+ /*
+ * Enforce correct ordering of changes, merged from multiple
+ * subtransactions. The changes may have the same LSN due to
+ * MULTI_INSERT xlog records.
+ */
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
+
+ prev_lsn = change->lsn;
I did not understand, how this change is relavent to this patch
4.
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn, command_id);
In which case, txn->is_schema_sent will be true, because at the end of
the stream in ReorderBufferExecuteInvalidations we are always setting
it false,
so while sending next stream it will always be false. That means we
never required snapshot_now variable in ReorderBufferTXN.
5.
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * We read catalog changes from WAL, which are not yet sent, so
+ * invalidate current schema in order output plugin can resend
+ * schema again.
+ */
+ txn->is_schema_sent = false;
Same as point 4, during decode time it will never be true.
6.
+ /* send fields */
+ pq_sendint64(out, commit_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);
Commit_time and end_lsn is used in standby_feedback
7.
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
We can not rollback intermediate subtransaction without rollbacking
latest sub-transaction, so why do we need
to search in the array? It will always be the the last subxact no?
8.
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
Why feedback is sent for every change?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-10 04:22:13 |
Message-ID: | CAA4eK1JJiU-w1NS9WB4NYCp-P8+5eLM5o83qVjKxbUT1rG8+qA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set. I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore. Could you
> > please send a rebased version? I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>
> Apart from this, there is one issue reported by my colleague Vignesh.
> The issue is that if we use more than two relations in a transaction
> then there is an error on standby (no relation map entry for remote
> relation ID 16390). After analyzing I have found that for the
> streaming transaction an "is_schema_sent" flag is kept in
> ReorderBufferTXN. And, I think that is done so that we can send the
> schema for each transaction stream so that if any subtransaction gets
> aborted we don't lose the logical WAL for that schema. But, this
> solution has induced a very basic issue that if a transaction operate
> on more than 1 relation then after sending the schema for the first
> relation it will mark the flag true and the schema for the subsequent
> relations will never be sent.
>
How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list. Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-10 04:53:19 |
Message-ID: | CAFiTN-u7tUT46-=8-yEkp9koJYipk3caXOz=7emZOTyqZfG7KQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > >
> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > I have rebased the patch on the latest head and also fix the issue of
> > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > the complete patch set. I have added the version number so that we
> > > > can track the changes.
> > >
> > > The patch has rotten a bit and does not apply anymore. Could you
> > > please send a rebased version? I have moved it to next CF, waiting on
> > > author.
> >
> > I have rebased the patch set on the latest head.
> >
> > Apart from this, there is one issue reported by my colleague Vignesh.
> > The issue is that if we use more than two relations in a transaction
> > then there is an error on standby (no relation map entry for remote
> > relation ID 16390). After analyzing I have found that for the
> > streaming transaction an "is_schema_sent" flag is kept in
> > ReorderBufferTXN. And, I think that is done so that we can send the
> > schema for each transaction stream so that if any subtransaction gets
> > aborted we don't lose the logical WAL for that schema. But, this
> > solution has induced a very basic issue that if a transaction operate
> > on more than 1 relation then after sending the schema for the first
> > relation it will mark the flag true and the schema for the subsequent
> > relations will never be sent.
> >
>
> How about keeping a list of top-level xids in each RelationSyncEntry?
> Basically, whenever we send the schema for any transaction, we note
> that in RelationSyncEntry and at abort time we can remove xid from the
> list. Now, whenever, we check whether to send schema for any
> operation in a transaction, we will check if our xid is present in
> that list for a particular RelationSyncEntry and take an action based
> on that (if xid is present, then we won't send the schema, otherwise,
> send it).
The idea make sense to me. I will try to write a patch for this and test.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-11 11:52:28 |
Message-ID: | CAA4eK1J_YwXrenOjEVF-uz0+2YVdQD2e8mOsyvSfiOXkhjzMtA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have review the patch set and here are few comments/questions
>
> 1.
> +static void
> +pg_decode_stream_change(LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + Relation relation,
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
>
> Should we show the tuple in the streamed change like we do for the
> pg_decode_change?
>
I think so. The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?
> 2. pg_logical_slot_get_changes_guts
> It recreate the decoding slot [ctx =
> CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> to false, should we pass a parameter to
> pg_logical_slot_get_changes_guts saying whether we want streamed results or not
>
CreateDecodingContext internally calls StartupDecodingContext which
sets the value of streaming based on if the plugin has provided
callbacks for streaming functions. Isn't that sufficient? Why do we
need additional parameters here?
> 3.
> + XLogRecPtr prev_lsn = InvalidXLogRecPtr;
> ReorderBufferChange *change;
> ReorderBufferChange *specinsert = NULL;
>
> @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> Relation relation = NULL;
> Oid reloid;
>
> + /*
> + * Enforce correct ordering of changes, merged from multiple
> + * subtransactions. The changes may have the same LSN due to
> + * MULTI_INSERT xlog records.
> + */
> + if (prev_lsn != InvalidXLogRecPtr)
> + Assert(prev_lsn <= change->lsn);
> +
> + prev_lsn = change->lsn;
> I did not understand, how this change is relavent to this patch
>
This is just to ensure that changes are in LSN order. I think as we
are merging the changes before commit for streaming, it is good to
have such an Assertion for ReorderBufferStreamTXN. And, if we want
to have it in ReorderBufferStreamTXN, then there is no harm in keeping
it in ReorderBufferCommit() at least to keep the code consistent. Do
you see any problem with this?
> 4.
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn, command_id);
>
> In which case, txn->is_schema_sent will be true, because at the end of
> the stream in ReorderBufferExecuteInvalidations we are always setting
> it false,
> so while sending next stream it will always be false. That means we
> never required snapshot_now variable in ReorderBufferTXN.
>
You are probably right, but as discussed we need to change this part
of design/code (when to send schema changes) due to the issues
discovered. So, I think this part will anyway change when we fix that
problem.
> 5.
> @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> *rb, TransactionId xid,
> txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
> txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> +
> + /*
> + * We read catalog changes from WAL, which are not yet sent, so
> + * invalidate current schema in order output plugin can resend
> + * schema again.
> + */
> + txn->is_schema_sent = false;
>
> Same as point 4, during decode time it will never be true.
>
Sure, my previous point's reply applies here as well.
> 6.
> + /* send fields */
> + pq_sendint64(out, commit_lsn);
> + pq_sendint64(out, txn->end_lsn);
> + pq_sendint64(out, txn->commit_time);
>
> Commit_time and end_lsn is used in standby_feedback
>
I don't understand what you mean by this. Can you be a bit more clear?
>
> 7.
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> We can not rollback intermediate subtransaction without rollbacking
> latest sub-transaction, so why do we need
> to search in the array? It will always be the the last subxact no?
>
The same thing is already mentioned in the comments above this code
("XXX Or perhaps we can rely on the aborts to arrive in the reverse
order, i.e. from the inner-most subxact (when nested)? In which case
we could simply check the last element."). I think what you are
saying is probably right, but we can leave this as it is for now
because this is a minor optimization which can be done later as well
if required. However, if you see any correctness issue, then we can
discuss.
> 8.
> + /*
> + * send feedback to upstream
> + *
> + * XXX Probably should send a valid LSN. But which one?
> + */
> + send_feedback(InvalidXLogRecPtr, false, false);
>
> Why feedback is sent for every change?
>
I will study this part of the patch and let you know my opinion.
Few comments on this patch series:
0001-Immediately-WAL-log-assignments:
------------------------------------------------------------
The commit message still refers to the old design for this patch. I
think you need to modify the commit message as per the latest patch.
0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);
You have removed logging for the above cache but forgot to remove its
reference from one of the places. Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?
0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------
1.
+ are required, while <function>stream_message_cb</function> and
+ <function>stream_message_cb</function> are optional.
stream_message_cb is mentioned twice. It seems the second one is for truncate.
2.
size of the transaction size and network bandwidth, the transfer time
+ may significantly increase the apply lag.
/size of the transaction size/size of the transaction
no need to mention size twice.
3.
+ Similarly to spill-to-disk behavior, streaming is triggered when the total
+ amount of changes decoded from the WAL (for all in-progress
transactions)
+ exceeds limit defined by <varname>logical_work_mem</varname> setting.
The guc name used is wrong. /Similarly to/Similar to/
4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}
See, if we can fix these and similar in the callback for the stop. I
think we don't have final_lsn till we commit/abort. Can we compute
before calling these API's?
0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
PG_CATCH();
{
/* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
if (iterstate)
ReorderBufferIterTXNFinish(rb, iterstate);
Spurious line change.
2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.
0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------
1.
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+
ReorderBuffer *rb,
+
ReorderBufferTXN
*txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+ ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+
ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);
Do we really need to introduce new APIs for iterating over changes
from streamed transactions? Why can't we reuse the same API's as we
use for committed xacts?
2.
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
Please write some comments atop ReorderBufferStreamCommit.
3.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
..
+ if (txn->snapshot_now
== NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+
Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id ==
InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+
subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+ if (subtxn->base_snapshot != NULL &&
+
(txn->base_snapshot == NULL ||
+ txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+ {
+
txn->base_snapshot = subtxn->base_snapshot;
The logic here seems to be correct, but I am not sure why it is not
considered to purge the base snapshot before assigning the subtxn's
snapshot and similarly, we have not purged snapshot for subtxn once we
are done with it. I think we can use
ReorderBufferTransferSnapToParent to replace part of the logic here.
Do you see any reason for doing things differently here?
4. In ReorderBufferStreamTXN, why do you need to use
ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.
5. I see a lot of code similarity in ReorderBufferStreamTXN and
existing ReorderBufferCommit. I understand that there are some subtle
differences due to which we need to write this new function but can't
we encapsulate the specific parts of code in functions and then call
from both places. I am talking about code in different cases for
change->action.
6. + * Note: We never stream and serialize a transaction at the same time (e
/(e/(we
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-11 18:16:28 |
Message-ID: | CA+TgmoYH6N_YDvKH9AaAJo5ZTHn142K=B75VO9yKvjjjHcoZhA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have rebased the patch set on the latest head.
0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.
With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.
Regarding 0005, it seems to me that this is no good:
+ errmsg("improper heap_getnext call")));
I think we should be using elog() rather than ereport() here, because
this should only happen if there's a bug in a logical decoding plugin.
At first, I thought maybe this should just be an Assert(), but since
there are third-party logical decoding plugins available, checking
this even in non-assert builds seems like a good idea. However, I
think making it translatable is overkill; users should never see this,
only developers.
I also think that the message is really bad, because it just tells you
did something bad. It gives no inkling as to why it was bad.
0006 contains lots of XXX comments that look like real issues. I guess
those need to be fixed. Also, why don't we do the thing that the
commit message for 0006 says we could "theoretically" do? I don't
understand why we need the k-way merge at all,
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
There is no reason to ever write an if statement that contains only an
Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
|| prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.
The purpose and mechanism of the is_schema_sent flag is not clear to
me. The word "schema" here seems to be being used to mean "snapshot,"
which is rather confusing.
I'm also somewhat unclear on what's happening here with invalidations.
Perhaps that's as much a defect in my understanding as it is
reflective of any problem with the patch, but I also don't see any
comments either in 0002 or later patches explaining the theory of
operation. If I've missed some, please point me in the right
direction. Hypothetically speaking, it seems to me that if you just
did InvalidateSystemCaches() every time the snapshot changed, you
wouldn't need anything else (unless we're concerned with
non-transactional invalidation messages like smgr and relmapper
invalidations; not quite sure how those are handled). And, on the
other hand, if we don't do InvalidateSystemCaches() every time the
snapshot changes, then I don't understand why this works now, even
without streaming.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-12 04:14:50 |
Message-ID: | CAFiTN-twgco+2XcmVz=T-ZNtgOwc5vKEtBo_KLGHZnh3CqwJzA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have review the patch set and here are few comments/questions
> >
> > 1.
> > +static void
> > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > + ReorderBufferTXN *txn,
> > + Relation relation,
> > + ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> >
> > Should we show the tuple in the streamed change like we do for the
> > pg_decode_change?
> >
>
> I think so. The patch shows the message in
> pg_decode_stream_message(), so why to prohibit showing tuple here?
>
> > 2. pg_logical_slot_get_changes_guts
> > It recreate the decoding slot [ctx =
> > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> > to false, should we pass a parameter to
> > pg_logical_slot_get_changes_guts saying whether we want streamed results or not
> >
>
> CreateDecodingContext internally calls StartupDecodingContext which
> sets the value of streaming based on if the plugin has provided
> callbacks for streaming functions. Isn't that sufficient? Why do we
> need additional parameters here?
I don't think that if plugin provides streaming function then we
should stream. Like pgoutput plugin provides streaming function but
we only stream if streaming is on in create subscription command. So
I feel that should be true with any plugin.
>
> > 3.
> > + XLogRecPtr prev_lsn = InvalidXLogRecPtr;
> > ReorderBufferChange *change;
> > ReorderBufferChange *specinsert = NULL;
> >
> > @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > Relation relation = NULL;
> > Oid reloid;
> >
> > + /*
> > + * Enforce correct ordering of changes, merged from multiple
> > + * subtransactions. The changes may have the same LSN due to
> > + * MULTI_INSERT xlog records.
> > + */
> > + if (prev_lsn != InvalidXLogRecPtr)
> > + Assert(prev_lsn <= change->lsn);
> > +
> > + prev_lsn = change->lsn;
> > I did not understand, how this change is relavent to this patch
> >
>
> This is just to ensure that changes are in LSN order. I think as we
> are merging the changes before commit for streaming, it is good to
> have such an Assertion for ReorderBufferStreamTXN. And, if we want
> to have it in ReorderBufferStreamTXN, then there is no harm in keeping
> it in ReorderBufferCommit() at least to keep the code consistent. Do
> you see any problem with this?
I am fine with this.
>
> > 4.
> > + /*
> > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > + * information about subtransactions, which could arrive after streaming start.
> > + */
> > + if (!txn->is_schema_sent)
> > + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > + txn, command_id);
> >
> > In which case, txn->is_schema_sent will be true, because at the end of
> > the stream in ReorderBufferExecuteInvalidations we are always setting
> > it false,
> > so while sending next stream it will always be false. That means we
> > never required snapshot_now variable in ReorderBufferTXN.
> >
>
> You are probably right, but as discussed we need to change this part
> of design/code (when to send schema changes) due to the issues
> discovered. So, I think this part will anyway change when we fix that
> problem.
Make sense.
>
> > 5.
> > @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > *rb, TransactionId xid,
> > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> >
> > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > +
> > + /*
> > + * We read catalog changes from WAL, which are not yet sent, so
> > + * invalidate current schema in order output plugin can resend
> > + * schema again.
> > + */
> > + txn->is_schema_sent = false;
> >
> > Same as point 4, during decode time it will never be true.
> >
>
> Sure, my previous point's reply applies here as well.
ok
>
> > 6.
> > + /* send fields */
> > + pq_sendint64(out, commit_lsn);
> > + pq_sendint64(out, txn->end_lsn);
> > + pq_sendint64(out, txn->commit_time);
> >
> > Commit_time and end_lsn is used in standby_feedback
> >
>
> I don't understand what you mean by this. Can you be a bit more clear?
I think I paste it here by mistake. just ignore it.
>
> >
> > 7.
> > + /* FIXME optimize the search by bsearch on sorted data */
> > + for (i = nsubxacts; i > 0; i--)
> > + {
> > + if (subxacts[i - 1].xid == subxid)
> > + {
> > + subidx = (i - 1);
> > + found = true;
> > + break;
> > + }
> > + }
> > We can not rollback intermediate subtransaction without rollbacking
> > latest sub-transaction, so why do we need
> > to search in the array? It will always be the the last subxact no?
> >
>
> The same thing is already mentioned in the comments above this code
> ("XXX Or perhaps we can rely on the aborts to arrive in the reverse
> order, i.e. from the inner-most subxact (when nested)? In which case
> we could simply check the last element."). I think what you are
> saying is probably right, but we can leave this as it is for now
> because this is a minor optimization which can be done later as well
> if required. However, if you see any correctness issue, then we can
> discuss.
I think more than optimization here we have the question of whether
this loop is required at all or not. Because, by optimizing we are
not adding the complexity, infact it will be simple. I think here we
need more analysis that whether we need to traverse the array or not.
So maybe for time being we can leave this as it is.
>
> > 8.
> > + /*
> > + * send feedback to upstream
> > + *
> > + * XXX Probably should send a valid LSN. But which one?
> > + */
> > + send_feedback(InvalidXLogRecPtr, false, false);
> >
> > Why feedback is sent for every change?
> >
>
> I will study this part of the patch and let you know my opinion.
Sure.
>
> Few comments on this patch series:
>
> 0001-Immediately-WAL-log-assignments:
> ------------------------------------------------------------
>
> The commit message still refers to the old design for this patch. I
> think you need to modify the commit message as per the latest patch.
>
> 0002-Issue-individual-invalidations-with-wal_level-log
> ----------------------------------------------------------------------------
> 1.
> xact_desc_invalidations(StringInfo buf,
> {
> ..
> + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
>
> You have removed logging for the above cache but forgot to remove its
> reference from one of the places. Also, I think you need to add a
> comment somewhere in inval.c to say why you are writing for WAL for
> some types of invalidations and not for others?
>
> 0003-Extend-the-output-plugin-API-with-stream-methods
> --------------------------------------------------------------------------------
> 1.
> + are required, while <function>stream_message_cb</function> and
> + <function>stream_message_cb</function> are optional.
>
> stream_message_cb is mentioned twice. It seems the second one is for truncate.
>
> 2.
> size of the transaction size and network bandwidth, the transfer time
> + may significantly increase the apply lag.
>
> /size of the transaction size/size of the transaction
>
> no need to mention size twice.
>
> 3.
> + Similarly to spill-to-disk behavior, streaming is triggered when the total
> + amount of changes decoded from the WAL (for all in-progress
> transactions)
> + exceeds limit defined by <varname>logical_work_mem</varname> setting.
>
> The guc name used is wrong. /Similarly to/Similar to/
>
> 4.
> stream_start_cb_wrapper()
> {
> ..
> + /* state.report_location = apply_lsn; */
> ..
> + /* FIXME ctx->write_location = apply_lsn; */
> ..
> }
>
> See, if we can fix these and similar in the callback for the stop. I
> think we don't have final_lsn till we commit/abort. Can we compute
> before calling these API's?
>
>
> 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> ----------------------------------------------------------------------------------
> 1.
> @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> PG_CATCH();
> {
> /* TODO: Encapsulate cleanup
> from the PG_TRY and PG_CATCH blocks */
> +
> if (iterstate)
> ReorderBufferIterTXNFinish(rb, iterstate);
>
> Spurious line change.
>
> 2. The commit message of this patch refers to Prepared transactions.
> I think that needs to be changed.
>
> 0006-Implement-streaming-mode-in-ReorderBuffer
> -------------------------------------------------------------------------
> 1.
> +
> +/* iterator for streaming (only get data from memory) */
> +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
> +
> ReorderBuffer *rb,
> +
> ReorderBufferTXN
> *txn);
> +
> +static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
> + ReorderBuffer *rb,
> +
> ReorderBufferStreamIterTXNState * state);
> +
> +static void ReorderBufferStreamIterTXNFinish(
> +
> ReorderBuffer *rb,
> +
> ReorderBufferStreamIterTXNState * state);
>
> Do we really need to introduce new APIs for iterating over changes
> from streamed transactions? Why can't we reuse the same API's as we
> use for committed xacts?
>
> 2.
> +static void
> +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> Please write some comments atop ReorderBufferStreamCommit.
>
> 3.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> ..
> + if (txn->snapshot_now
> == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> +
> Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id ==
> InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> +
> subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> +
> + if (subtxn->base_snapshot != NULL &&
> +
> (txn->base_snapshot == NULL ||
> + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
> + {
> +
> txn->base_snapshot = subtxn->base_snapshot;
>
> The logic here seems to be correct, but I am not sure why it is not
> considered to purge the base snapshot before assigning the subtxn's
> snapshot and similarly, we have not purged snapshot for subtxn once we
> are done with it. I think we can use
> ReorderBufferTransferSnapToParent to replace part of the logic here.
> Do you see any reason for doing things differently here?
>
> 4. In ReorderBufferStreamTXN, why do you need to use
> ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.
>
> 5. I see a lot of code similarity in ReorderBufferStreamTXN and
> existing ReorderBufferCommit. I understand that there are some subtle
> differences due to which we need to write this new function but can't
> we encapsulate the specific parts of code in functions and then call
> from both places. I am talking about code in different cases for
> change->action.
>
> 6. + * Note: We never stream and serialize a transaction at the same time (e
> /(e/(we
>
I will look into these comments and reply separately.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-12 11:41:33 |
Message-ID: | CAA4eK1KFPT6dsOenSfHORYcYvd1PVGxs_695kED=-xLfHGUw8A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I have rebased the patch set on the latest head.
>
> 0001 looks like a clever approach, but are you sure it doesn't hurt
> performance when many small XLOG records are being inserted? I think
> XLogRecordAssemble() can get pretty hot in some workloads.
>
I don't think we have evaluated it yet, but we should do it. The
point to note is that it is only for the case when wal_level is
'logical' (see IsSubTransactionAssignmentPending) in which case we
already log more WAL, so this might not impact much. I guess that it
might be better to have that check in XLogRecordAssemble for the sake
of clarity.
>
> Regarding 0005, it seems to me that this is no good:
>
> + errmsg("improper heap_getnext call")));
>
> I think we should be using elog() rather than ereport() here, because
> this should only happen if there's a bug in a logical decoding plugin.
> At first, I thought maybe this should just be an Assert(), but since
> there are third-party logical decoding plugins available, checking
> this even in non-assert builds seems like a good idea. However, I
> think making it translatable is overkill; users should never see this,
> only developers.
>
makes sense. I think we should change it.
>
> + if (prev_lsn != InvalidXLogRecPtr)
> + Assert(prev_lsn <= change->lsn);
>
> There is no reason to ever write an if statement that contains only an
> Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
> || prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.
>
Agreed.
> The purpose and mechanism of the is_schema_sent flag is not clear to
> me. The word "schema" here seems to be being used to mean "snapshot,"
> which is rather confusing.
>
I have explained this flag below along with invalidations as both are
slightly related.
> I'm also somewhat unclear on what's happening here with invalidations.
> Perhaps that's as much a defect in my understanding as it is
> reflective of any problem with the patch, but I also don't see any
> comments either in 0002 or later patches explaining the theory of
> operation. If I've missed some, please point me in the right
> direction. Hypothetically speaking, it seems to me that if you just
> did InvalidateSystemCaches() every time the snapshot changed, you
> wouldn't need anything else (unless we're concerned with
> non-transactional invalidation messages like smgr and relmapper
> invalidations; not quite sure how those are handled). And, on the
> other hand, if we don't do InvalidateSystemCaches() every time the
> snapshot changes, then I don't understand why this works now, even
> without streaming.
>
I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache). However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages. So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.
The second related concept is that before sending each change to
downstream (via pgoutput), we check whether we need to send the
schema. This we decide based on the local map entry
(RelationSyncEntry) which indicates whether the schema for the
relation is already sent or not. Once the schema of the relation is
sent, the entry for that relation in the map will indicate it. At the
time of invalidation processing we also blew up this map, so it always
reflects the correct state.
Now, to decode an in-progress transaction, we need to ensure that we
have received the WAL for all the invalidations before decoding the
WAL of action that happened immediately after that catalog change.
This is the reason we started WAL logging individual Invalidations.
So, with this change we don't need to execute all the invalidations
for each catalog change, rather execute them as and when their WAL is
being decoded.
The current mechanism to send schema changes won't work for streaming
transactions because after sending the change, subtransaction might
abort. On subtransaction abort, the downstream will simply discard
the changes where we will lose the previous schema change sent. There
is no such problem currently because we process all the aborts before
sending any change. So, the current idea of having a schema_sent flag
in each map entry (RelationSyncEntry) won't work for streaming
transactions. To solve this problem initially patch has kept a flag
'is_schema_sent' for each top-level transaction (in ReorderBufferTXN)
so that we can always send a schema for each (sub)transaction for
streaming transactions, but that won't work if we access multiple
relations in the same subtransaction. To solve this problem, we are
thinking of keeping a list/array of top-level xids in each
RelationSyncEntry. Basically, whenever we send the schema for any
transaction, we note that in RelationSyncEntry and at abort/commit
time we can remove xid from the list. Now, whenever, we check whether
to send schema for any operation in a transaction, we will check if
our xid is present in that list for a particular RelationSyncEntry and
take an action based on that (if xid is present, then we won't send
the schema, otherwise, send it). I think during decode, we should not
have that may open transactions, so the search in the array should be
cheap enough but we can consider some other data structure like hash
as well.
I will think some more and respond to your remaining comments/suggestions.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-13 09:16:20 |
Message-ID: | CAA4eK1LOa+2KqNX=m=1qMBDW+o50AuwjAOX6ZqL-rWGiH1F9MQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > I have rebased the patch set on the latest head.
>
> 0001 looks like a clever approach, but are you sure it doesn't hurt
> performance when many small XLOG records are being inserted? I think
> XLogRecordAssemble() can get pretty hot in some workloads.
>
> With regard to 0002, logging a separate WAL record for each
> invalidation seems painful; I think most operations that generate
> invalidations generate a bunch of them all at once. Perhaps you could
> just queue up invalidations as they happen, and then force anything
> that's been queued up to be emitted into WAL just before you emit any
> WAL record that might need to be decoded.
>
I feel we can log the invalidations of the entire command at one go if
we log at CommandEndInvalidationMessages. We already have all the
invalidations of current command in
transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
maintaining a new separate list/queue for invalidations and to a good
extent, it will ameliorate your concern of logging each invalidation
separately.
>
> 0006 contains lots of XXX comments that look like real issues. I guess
> those need to be fixed. Also, why don't we do the thing that the
> commit message for 0006 says we could "theoretically" do? I don't
> understand why we need the k-way merge at all,
>
I think we can do what is written in the commit message, but then we
need to maintain two paths (one for streaming contexts and other for
non-streaming contexts) unless we want to entirely get rid of storing
subtransaction changes separately which seems like a more fundamental
change. Right now, also to some extent such things are there, but I
have already given a comment to minimize it. Having said that, I
think we can go either way. I think the original intention was to
avoid doing more stuff unless it is really required as this is already
a big patchset, but maybe Tomas has a different idea about this.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-13 10:05:05 |
Message-ID: | CAA4eK1Kf=XHnGpSV-c9iwQ9T=ms1F4e0GpuDXoyL+g++XUGkjg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Dec 12, 2019 at 9:45 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > I have review the patch set and here are few comments/questions
> > >
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > Should we show the tuple in the streamed change like we do for the
> > > pg_decode_change?
> > >
> >
> > I think so. The patch shows the message in
> > pg_decode_stream_message(), so why to prohibit showing tuple here?
> >
> > > 2. pg_logical_slot_get_changes_guts
> > > It recreate the decoding slot [ctx =
> > > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> > > to false, should we pass a parameter to
> > > pg_logical_slot_get_changes_guts saying whether we want streamed results or not
> > >
> >
> > CreateDecodingContext internally calls StartupDecodingContext which
> > sets the value of streaming based on if the plugin has provided
> > callbacks for streaming functions. Isn't that sufficient? Why do we
> > need additional parameters here?
>
> I don't think that if plugin provides streaming function then we
> should stream. Like pgoutput plugin provides streaming function but
> we only stream if streaming is on in create subscription command. So
> I feel that should be true with any plugin.
>
How about adding a new boolean parameter (streaming) in
pg_create_logical_replication_slot()?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-20 06:17:08 |
Message-ID: | CA+fd4k5KBRw5O45D32=zR_zgU-fMhD_0iudZwm-MYY1P2bUZ7Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set. I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore. Could you
> > please send a rebased version? I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
Thank you for working on this.
This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber. Another problem would be that the worker doesn't receive
changes during applying changes of other transactions. These things
make me think it's better to have a new worker dedicated to apply
changes like we have the wal receiver process and the startup process.
Maybe we can have 2 workers (receiver and applyer) per subscriptions.
Any thoughts?
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | amit(dot)kapila16(at)gmail(dot)com |
Cc: | robertmhaas(at)gmail(dot)com, dilipbalaut(at)gmail(dot)com, michael(at)paquier(dot)xyz, tomas(dot)vondra(at)2ndquadrant(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-20 08:29:21 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hello.
At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > I have rebased the patch set on the latest head.
> >
> > 0001 looks like a clever approach, but are you sure it doesn't hurt
> > performance when many small XLOG records are being inserted? I think
> > XLogRecordAssemble() can get pretty hot in some workloads.
> >
> > With regard to 0002, logging a separate WAL record for each
> > invalidation seems painful; I think most operations that generate
> > invalidations generate a bunch of them all at once. Perhaps you could
> > just queue up invalidations as they happen, and then force anything
> > that's been queued up to be emitted into WAL just before you emit any
> > WAL record that might need to be decoded.
> >
>
> I feel we can log the invalidations of the entire command at one go if
> we log at CommandEndInvalidationMessages. We already have all the
> invalidations of current command in
> transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
> maintaining a new separate list/queue for invalidations and to a good
> extent, it will ameliorate your concern of logging each invalidation
> separately.
I have a question on this. Does that mean that the current logical
decoder (or reorderbuffer) may emit incorrect result if it made a
catalog change during the current transaction being decoded? If so,
this is not a feature but a bug fix.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-20 13:30:21 |
Message-ID: | CAA4eK1KOzF6wF3AyQENs-JYNOL0oPMhxMXbOMMZBqjJm7aAjnQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > >
> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > I have rebased the patch on the latest head and also fix the issue of
> > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > the complete patch set. I have added the version number so that we
> > > > can track the changes.
> > >
> > > The patch has rotten a bit and does not apply anymore. Could you
> > > please send a rebased version? I have moved it to next CF, waiting on
> > > author.
> >
> > I have rebased the patch set on the latest head.
>
> Thank you for working on this.
>
> This might have already been discussed but I have a question about the
> changes of logical replication worker. In the current logical
> replication there is a problem that the response time are doubled when
> using synchronous replication because wal senders send changes after
> commit. It's worse especially when a transaction makes a lot of
> changes. So I expected this feature to reduce the response time by
> sending changes even while the transaction is progressing but it
> doesn't seem to be. The logical replication worker writes changes to
> temporary files and applies these changes when the worker received
> commit record (STREAM COMMIT). Since the worker sends the LSN of
> commit record as flush LSN to the publisher after applying all
> changes, the publisher must wait for all changes are applied to the
> subscriber.
>
The main aim of this feature is to reduce apply lag. Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits. Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy). So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that. Does this
make sense to you?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-20 13:42:27 |
Message-ID: | CAA4eK1LpJu=+AMZw56ovtw3m-zF3yH=UGgdnmE10MAgt+e-JmA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Dec 20, 2019 at 2:00 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> Hello.
>
> At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> > On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > I have rebased the patch set on the latest head.
> > >
> > > 0001 looks like a clever approach, but are you sure it doesn't hurt
> > > performance when many small XLOG records are being inserted? I think
> > > XLogRecordAssemble() can get pretty hot in some workloads.
> > >
> > > With regard to 0002, logging a separate WAL record for each
> > > invalidation seems painful; I think most operations that generate
> > > invalidations generate a bunch of them all at once. Perhaps you could
> > > just queue up invalidations as they happen, and then force anything
> > > that's been queued up to be emitted into WAL just before you emit any
> > > WAL record that might need to be decoded.
> > >
> >
> > I feel we can log the invalidations of the entire command at one go if
> > we log at CommandEndInvalidationMessages. We already have all the
> > invalidations of current command in
> > transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of
> > maintaining a new separate list/queue for invalidations and to a good
> > extent, it will ameliorate your concern of logging each invalidation
> > separately.
>
> I have a question on this. Does that mean that the current logical
> decoder (or reorderbuffer)
>
What does currently refer to here? Is it about HEAD or about the
patch? Without the patch, we decode only at commit time and by that
time we have all invalidations (logged with commit WAL record), so we
just execute them at each catalog change (see the actions in
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). The patch has to
separately WAL log each invalidation because we can decode the
intermittent changes, so we can't wait till commit. The above is just
an optimization for the patch. AFAIK, there is no correctness issue
here, but let me know if you see any.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-22 11:34:19 |
Message-ID: | CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set. I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore. Could you
> > please send a rebased version? I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>
Few comments:
assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+ int64 subidx;
+ bool found = false;
+ char path[MAXPGPATH];
+
+ subidx = -1;
+ subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ /* We should not receive aborts for unknown subtransactions. */
+ Assert(found);
Add the typedefs like below in typedefs.lst common across the patches:
xl_xact_invalidations, ReorderBufferStreamIterTXNEntry,
ReorderBufferStreamIterTXNState, SubXactInfo
"are written" appears twice in commit message of
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch patch
does not compile by itself:
reorderbuffer.c:1822:9: error: ‘ReorderBufferTXN’ has no member named
‘is_schema_sent’
+
LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ txn->is_schema_sent = false;
+ break;
Should we include printing of id here like in earlier cases in
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
+ appendStringInfo(buf, " relcache %u", msg->rc.relId);
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)
+ appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
There is some code duplication in stream_change_cb_wrapper,
stream_truncate_cb_wrapper, stream_message_cb_wrapper,
stream_abort_cb_wrapper, stream_commit_cb_wrapper,
stream_start_cb_wrapper and stream_stop_cb_wrapper functions in
v2-0003-Extend-the-output-plugin-API-with-stream-methods.patch patch.
Should we have a separate function for common code?
Should we can add function header for AssertChangeLsnOrder in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
loop, can be checked only once:
+ dlist_foreach(iter, &txn->changes)
+ {
+ ReorderBufferChange *cur_change;
+
+ cur_change = dlist_container(ReorderBufferChange,
node, iter.cur);
+
+ Assert(txn->first_lsn != InvalidXLogRecPtr);
+ Assert(cur_change->lsn != InvalidXLogRecPtr);
+ Assert(txn->first_lsn <= cur_change->lsn);
Should we add function header for ReorderBufferDestroyTupleCidHash in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ if (txn->tuplecid_hash != NULL)
+ {
+ hash_destroy(txn->tuplecid_hash);
+ txn->tuplecid_hash = NULL;
+ }
+}
+
Should we add function header for ReorderBufferStreamCommit in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ /* we should only call this for previously streamed transactions */
+ Assert(rbtxn_is_streamed(txn));
+
+ ReorderBufferStreamTXN(rb, txn);
+
+ rb->stream_commit(rb, txn, txn->final_lsn);
+
+ ReorderBufferCleanupTXN(rb, txn);
+}
+
Should we add function header for ReorderBufferCanStream in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch
does not apply:
Hunk #18 FAILED at 2035.
Hunk #19 succeeded at 2199 (offset -16 lines).
1 out of 19 hunks FAILED -- saving rejects to file
src/backend/replication/logical/worker.c.rej
Header inclusion may not be required in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
*
*-------------------------------------------------------------------------
*/
+#include <sys/types.h>
+#include <unistd.h>
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-24 04:24:13 |
Message-ID: | CAA4eK1+ZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Dec 22, 2019 at 5:04 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> Few comments:
> assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
> v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
> + int64 subidx;
> + bool found = false;
> + char path[MAXPGPATH];
> +
> + subidx = -1;
> + subxact_info_read(MyLogicalRepWorker->subid, xid);
> +
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + /* We should not receive aborts for unknown subtransactions. */
> + Assert(found);
>
We can use PG_USED_FOR_ASSERTS_ONLY for that variable.
>
> Should we include printing of id here like in earlier cases in
> v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
> + appendStringInfo(buf, " relcache %u", msg->rc.relId);
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALSMGR_ID)
> + appendStringInfoString(buf, " smgr");
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALRELMAP_ID)
> + appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
>
I am not sure if this patch is logging these invalidations, so not
sure if it makes sense to add more ids in the cases you are referring
to. However, if we change it to logging all invalidations at command
end as being discussed in this thread, then it might be better to do
what you are suggesting.
>
> Should we can add function header for AssertChangeLsnOrder in
> v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
> +static void
> +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
>
> This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
> loop, can be checked only once:
> + dlist_foreach(iter, &txn->changes)
> + {
> + ReorderBufferChange *cur_change;
> +
> + cur_change = dlist_container(ReorderBufferChange,
> node, iter.cur);
> +
> + Assert(txn->first_lsn != InvalidXLogRecPtr);
> + Assert(cur_change->lsn != InvalidXLogRecPtr);
> + Assert(txn->first_lsn <= cur_change->lsn);
>
This makes sense to me. Another thing about this function, do we
really need "ReorderBuffer *rb" parameter in this function?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-24 05:28:01 |
Message-ID: | CA+TgmoaKhE2DyowbXeaD-pvdLV8vx0kukJ1arA74+eY_969hNw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I don't think we have evaluated it yet, but we should do it. The
> point to note is that it is only for the case when wal_level is
> 'logical' (see IsSubTransactionAssignmentPending) in which case we
> already log more WAL, so this might not impact much. I guess that it
> might be better to have that check in XLogRecordAssemble for the sake
> of clarity.
I don't think that this is really a valid argument. Just because we
have some overhead now doesn't mean that adding more won't hurt. Even
testing the wal_level costs a little something.
> I think the way invalidations work for logical replication is that
> normally, we always start a new transaction before decoding each
> commit which allows us to accept the invalidations (via
> AtStart_Cache). However, if there are catalog changes within the
> transaction being decoded, we need to reflect those before trying to
> decode the WAL of operation which happened after that catalog change.
> As we are not logging the WAL for each invalidation, we need to
> execute all the invalidation messages for this transaction at each
> catalog change. We are able to do that now as we decode the entire WAL
> for a transaction only once we get the commit's WAL which contains all
> the invalidation messages. So, we queue them up and execute them for
> each catalog change which we identify by WAL record
> XLOG_HEAP2_NEW_CID.
Thanks for the explanation. That makes sense. But, it's still true,
AFAICS, that instead of doing this stuff with logging invalidations
you could just InvalidateSystemCaches() in the cases where you are
currently applying all of the transaction's invalidations. That
approach might be worse than changing the way invalidations are
logged, but the two approaches deserve to be compared. One approach
has more CPU overhead and the other has more WAL overhead, so it's a
little hard to compare them, but it seems worth mulling over.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-24 05:46:55 |
Message-ID: | CA+fd4k4kCBVOFDGxKPen7A8SXvheodeE1=4VBdoKs9b4YvCewg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > >
> > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > > I have rebased the patch on the latest head and also fix the issue of
> > > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > > the complete patch set. I have added the version number so that we
> > > > > can track the changes.
> > > >
> > > > The patch has rotten a bit and does not apply anymore. Could you
> > > > please send a rebased version? I have moved it to next CF, waiting on
> > > > author.
> > >
> > > I have rebased the patch set on the latest head.
> >
> > Thank you for working on this.
> >
> > This might have already been discussed but I have a question about the
> > changes of logical replication worker. In the current logical
> > replication there is a problem that the response time are doubled when
> > using synchronous replication because wal senders send changes after
> > commit. It's worse especially when a transaction makes a lot of
> > changes. So I expected this feature to reduce the response time by
> > sending changes even while the transaction is progressing but it
> > doesn't seem to be. The logical replication worker writes changes to
> > temporary files and applies these changes when the worker received
> > commit record (STREAM COMMIT). Since the worker sends the LSN of
> > commit record as flush LSN to the publisher after applying all
> > changes, the publisher must wait for all changes are applied to the
> > subscriber.
> >
>
> The main aim of this feature is to reduce apply lag. Because if we
> send all the changes together it can delay there apply because of
> network delay, whereas if most of the changes are already sent, then
> we will save the effort on sending the entire data at commit time.
> This in itself gives us decent benefits. Sure, we can further improve
> it by having separate workers (dedicated to apply the changes) as you
> are suggesting and in fact, there is a patch for that as well(see the
> performance results and bgworker patch at [1]), but if try to shove in
> all the things in one go, then it will be difficult to get this patch
> committed (there are already enough things and the patch is quite big
> that to get it right takes a lot of energy). So, the plan is
> something like that first we get the basic feature and then try to
> improve by having dedicated workers or things like that. Does this
> make sense to you?
>
Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes. Otherwise
the buffer would be easily full and replication gets stuck.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-24 08:21:12 |
Message-ID: | CAA4eK1L-KYycdTYanqo3nDzw=XWvADOuerHtbBSnBiRejmE3Qg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > The main aim of this feature is to reduce apply lag. Because if we
> > send all the changes together it can delay there apply because of
> > network delay, whereas if most of the changes are already sent, then
> > we will save the effort on sending the entire data at commit time.
> > This in itself gives us decent benefits. Sure, we can further improve
> > it by having separate workers (dedicated to apply the changes) as you
> > are suggesting and in fact, there is a patch for that as well(see the
> > performance results and bgworker patch at [1]), but if try to shove in
> > all the things in one go, then it will be difficult to get this patch
> > committed (there are already enough things and the patch is quite big
> > that to get it right takes a lot of energy). So, the plan is
> > something like that first we get the basic feature and then try to
> > improve by having dedicated workers or things like that. Does this
> > make sense to you?
> >
>
> Thank you for explanation. The plan makes sense. But I think in the
> current design it's a problem that logical replication worker doesn't
> receive changes (and doesn't check interrupts) during applying
> committed changes even if we don't have a worker dedicated for
> applying. I think the worker should continue to receive changes and
> save them to temporary files even during applying changes.
>
Won't it beat the purpose of this feature which is to reduce the apply
lag? Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction. Also, won't it create some further
work to identify the order of commits? Say while applying commit-1,
it receives 5 other commits that are written to separate temporary
files. How will we later identify which transaction's WAL we need to
apply first? We might deduce by LSN's, but I think that could be
tricky. Another thing is that I think it could lead to some design
complications as well because while applying commit, you need some
sort of callback or something like that to receive and flush totally
unrelated changes. It could lead to another kind of failure mode
wherein while applying commit if it tries to receive another
transaction data and some failure happens while writing the data of
that transaction. I am not sure if it is a good idea to try something
like that.
> Otherwise
> the buffer would be easily full and replication gets stuck.
>
Are you telling about network buffer? I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-26 07:05:35 |
Message-ID: | CA+fd4k4ZO2mR34fZOrG_DFp4kr1sMXtSEP5_5MrG3SxVOW8XBA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > The main aim of this feature is to reduce apply lag. Because if we
> > > send all the changes together it can delay there apply because of
> > > network delay, whereas if most of the changes are already sent, then
> > > we will save the effort on sending the entire data at commit time.
> > > This in itself gives us decent benefits. Sure, we can further improve
> > > it by having separate workers (dedicated to apply the changes) as you
> > > are suggesting and in fact, there is a patch for that as well(see the
> > > performance results and bgworker patch at [1]), but if try to shove in
> > > all the things in one go, then it will be difficult to get this patch
> > > committed (there are already enough things and the patch is quite big
> > > that to get it right takes a lot of energy). So, the plan is
> > > something like that first we get the basic feature and then try to
> > > improve by having dedicated workers or things like that. Does this
> > > make sense to you?
> > >
> >
> > Thank you for explanation. The plan makes sense. But I think in the
> > current design it's a problem that logical replication worker doesn't
> > receive changes (and doesn't check interrupts) during applying
> > committed changes even if we don't have a worker dedicated for
> > applying. I think the worker should continue to receive changes and
> > save them to temporary files even during applying changes.
> >
>
> Won't it beat the purpose of this feature which is to reduce the apply
> lag? Basically, it can so happen that while applying commit, it
> constantly gets changes of other transactions which will delay the
> apply of the current transaction.
You're right. But it seems to me that it optimizes the apply lags of
only a transaction that made many changes. On the other hand if a
transaction made many changes applying of subsequent changes are
delayed.
> Also, won't it create some further
> work to identify the order of commits? Say while applying commit-1,
> it receives 5 other commits that are written to separate temporary
> files. How will we later identify which transaction's WAL we need to
> apply first? We might deduce by LSN's, but I think that could be
> tricky. Another thing is that I think it could lead to some design
> complications as well because while applying commit, you need some
> sort of callback or something like that to receive and flush totally
> unrelated changes. It could lead to another kind of failure mode
> wherein while applying commit if it tries to receive another
> transaction data and some failure happens while writing the data of
> that transaction. I am not sure if it is a good idea to try something
> like that.
It's just an idea but we might want to have new workers dedicated to
apply changes first and then we will have streaming option later. That
way we can reduce the flush lags depending on use cases. The commit
order can be determined by the receiver and shared with the applyer
in shared memory. Once we separated workers the streaming option can
be introduced without such a downside.
>
> > Otherwise
> > the buffer would be easily full and replication gets stuck.
> >
>
> Are you telling about network buffer?
Yes.
> I think the best way as
> discussed is to launch new workers for streamed transactions, but we
> can do that as an additional feature. Anyway, as proposed, users can
> choose the streaming mode for subscriptions, so there is an option to
> turn this selectively.
Yes. But user who wants to use this feature would want to replicate
many changes but I guess the side effect is quite big. I think that at
least we need to make the logical replication tolerate such situation.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-28 16:03:27 |
Message-ID: | 20191228160327.u5ttzrpawh3widyc@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
>On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> > >
>> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
>> > > > I have rebased the patch on the latest head and also fix the issue of
>> > > > "concurrent abort handling of the (sub)transaction." and attached as
>> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
>> > > > the complete patch set. I have added the version number so that we
>> > > > can track the changes.
>> > >
>> > > The patch has rotten a bit and does not apply anymore. Could you
>> > > please send a rebased version? I have moved it to next CF, waiting on
>> > > author.
>> >
>> > I have rebased the patch set on the latest head.
>> >
>> > Apart from this, there is one issue reported by my colleague Vignesh.
>> > The issue is that if we use more than two relations in a transaction
>> > then there is an error on standby (no relation map entry for remote
>> > relation ID 16390). After analyzing I have found that for the
>> > streaming transaction an "is_schema_sent" flag is kept in
>> > ReorderBufferTXN. And, I think that is done so that we can send the
>> > schema for each transaction stream so that if any subtransaction gets
>> > aborted we don't lose the logical WAL for that schema. But, this
>> > solution has induced a very basic issue that if a transaction operate
>> > on more than 1 relation then after sending the schema for the first
>> > relation it will mark the flag true and the schema for the subsequent
>> > relations will never be sent.
>> >
>>
>> How about keeping a list of top-level xids in each RelationSyncEntry?
>> Basically, whenever we send the schema for any transaction, we note
>> that in RelationSyncEntry and at abort time we can remove xid from the
>> list. Now, whenever, we check whether to send schema for any
>> operation in a transaction, we will check if our xid is present in
>> that list for a particular RelationSyncEntry and take an action based
>> on that (if xid is present, then we won't send the schema, otherwise,
>> send it).
>The idea make sense to me. I will try to write a patch for this and test.
>
Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(
Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.
0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)
0008 - adds the is_schema_sent back (essentially reverting 0003)
0009 - removes is_schema_sent entirely
0012 - adds the correct handling of schema flags in pgoutput
I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.
FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series. There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-29 08:04:07 |
Message-ID: | CAFiTN-v9ShPzhQ3RnsZkgidL5Mwc7XwaNzNLfAHgMHjfwFv4eQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
> >On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >>
> >> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >> >
> >> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >> > >
> >> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> >> > > > I have rebased the patch on the latest head and also fix the issue of
> >> > > > "concurrent abort handling of the (sub)transaction." and attached as
> >> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> >> > > > the complete patch set. I have added the version number so that we
> >> > > > can track the changes.
> >> > >
> >> > > The patch has rotten a bit and does not apply anymore. Could you
> >> > > please send a rebased version? I have moved it to next CF, waiting on
> >> > > author.
> >> >
> >> > I have rebased the patch set on the latest head.
> >> >
> >> > Apart from this, there is one issue reported by my colleague Vignesh.
> >> > The issue is that if we use more than two relations in a transaction
> >> > then there is an error on standby (no relation map entry for remote
> >> > relation ID 16390). After analyzing I have found that for the
> >> > streaming transaction an "is_schema_sent" flag is kept in
> >> > ReorderBufferTXN. And, I think that is done so that we can send the
> >> > schema for each transaction stream so that if any subtransaction gets
> >> > aborted we don't lose the logical WAL for that schema. But, this
> >> > solution has induced a very basic issue that if a transaction operate
> >> > on more than 1 relation then after sending the schema for the first
> >> > relation it will mark the flag true and the schema for the subsequent
> >> > relations will never be sent.
> >> >
> >>
> >> How about keeping a list of top-level xids in each RelationSyncEntry?
> >> Basically, whenever we send the schema for any transaction, we note
> >> that in RelationSyncEntry and at abort time we can remove xid from the
> >> list. Now, whenever, we check whether to send schema for any
> >> operation in a transaction, we will check if our xid is present in
> >> that list for a particular RelationSyncEntry and take an action based
> >> on that (if xid is present, then we won't send the schema, otherwise,
> >> send it).
> >The idea make sense to me. I will try to write a patch for this and test.
> >
>
> Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
> needs to be in the RelationSyncEntry. In fact, I already have code for
> that in my private repository - I thought the patches I sent here do
> include this, but apparently I forgot to include this bit :-(
>
> Attached is a rebased patch series, fixing this. It's essentially v2
> with a couple of patches (0003, 0008, 0009 and 0012) replacing the
> is_schema_sent with correct handling.
>
>
> 0003 - removes an is_schema_sent reference added prematurely (it's added
> by a later patch, causing compile failure)
>
> 0008 - adds the is_schema_sent back (essentially reverting 0003)
>
> 0009 - removes is_schema_sent entirely
>
> 0012 - adds the correct handling of schema flags in pgoutput
>
>
> I don't know what other changes you've made since v2, so this way it
> should be possible to just take 0003, 0008, 0009 and 0012 and slip them
> in with minimal hassle.
>
> FWIW thanks to everyone (and Amit and Dilip in particular) working on
> this patch series. There's been a lot of great reviews and improvements
> since I abandoned this thread for a while. I expect to be able to spend
> more time working on this in January.
>
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted? because once it is aborted we need to
resent the schema. I think we can remove the xid from the list in the
cleanup_rel_sync_cache function?
I have observed some more issues
1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
two messages can be in different streams. So we need to find a way to
handle this. Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?
2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes. So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself or we
can truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached. Am I missing something
here?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-30 09:40:47 |
Message-ID: | CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part. Meanwhile, I was working on few
review comments/bugfixes and refactoring. I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch. I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.
> On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > I have review the patch set and here are few comments/questions
> > >
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > Should we show the tuple in the streamed change like we do for the
> > > pg_decode_change?
> > >
> >
> > I think so. The patch shows the message in
> > pg_decode_stream_message(), so why to prohibit showing tuple here?
Yeah, we can do that. One option is that we can directly register
"pg_decode_change" function as stream_change_cb plugin and that will
show the tuple, another option is that we can write a similar function
as pg_decode_change and change the message which includes the text
"STREAM" so that the user can distinguish between tuple from committed
transaction and the in-progress transaction.
While analyzing this solution I have encountered one more issue, the
problem is that currently, during commit time in DecodeCommit we check
whether we need to skip the changes of the transaction or not by
calling SnapBuildXactNeedsSkip but since now we support streaming so
it's possible that before commit wal arrive we might have already sent
the changes to the output plugin even though we could have skipped
those changes. So my question is instead of checking at the commit
time can't we check before adding to ReorderBuffer itself or we can
truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.
> > Few comments on this patch series:
> >
> > 0001-Immediately-WAL-log-assignments:
> > ------------------------------------------------------------
> >
> > The commit message still refers to the old design for this patch. I
> > think you need to modify the commit message as per the latest patch.
Done
> >
> > 0002-Issue-individual-invalidations-with-wal_level-log
> > ----------------------------------------------------------------------------
> > 1.
> > xact_desc_invalidations(StringInfo buf,
> > {
> > ..
> > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> >
> > You have removed logging for the above cache but forgot to remove its
> > reference from one of the places. Also, I think you need to add a
> > comment somewhere in inval.c to say why you are writing for WAL for
> > some types of invalidations and not for others?
Done
> >
> > 0003-Extend-the-output-plugin-API-with-stream-methods
> > --------------------------------------------------------------------------------
> > 1.
> > + are required, while <function>stream_message_cb</function> and
> > + <function>stream_message_cb</function> are optional.
> >
> > stream_message_cb is mentioned twice. It seems the second one is for truncate.
Done
> >
> > 2.
> > size of the transaction size and network bandwidth, the transfer time
> > + may significantly increase the apply lag.
> >
> > /size of the transaction size/size of the transaction
> >
> > no need to mention size twice.
Done
> >
> > 3.
> > + Similarly to spill-to-disk behavior, streaming is triggered when the total
> > + amount of changes decoded from the WAL (for all in-progress
> > transactions)
> > + exceeds limit defined by <varname>logical_work_mem</varname> setting.
> >
> > The guc name used is wrong. /Similarly to/Similar to/
Done
> >
> > 4.
> > stream_start_cb_wrapper()
> > {
> > ..
> > + /* state.report_location = apply_lsn; */
> > ..
> > + /* FIXME ctx->write_location = apply_lsn; */
> > ..
> > }
> >
> > See, if we can fix these and similar in the callback for the stop. I
> > think we don't have final_lsn till we commit/abort. Can we compute
> > before calling these API's?
Done
> >
> >
> > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > ----------------------------------------------------------------------------------
> > 1.
> > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > PG_CATCH();
> > {
> > /* TODO: Encapsulate cleanup
> > from the PG_TRY and PG_CATCH blocks */
> > +
> > if (iterstate)
> > ReorderBufferIterTXNFinish(rb, iterstate);
> >
> > Spurious line change.
> >
Done
> > 2. The commit message of this patch refers to Prepared transactions.
> > I think that needs to be changed.
> >
> > 0006-Implement-streaming-mode-in-ReorderBuffer
> > -------------------------------------------------------------------------
> > 1.
> > +
> > +/* iterator for streaming (only get data from memory) */
> > +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferTXN
> > *txn);
> > +
> > +static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
> > + ReorderBuffer *rb,
> > +
> > ReorderBufferStreamIterTXNState * state);
> > +
> > +static void ReorderBufferStreamIterTXNFinish(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferStreamIterTXNState * state);
> >
> > Do we really need to introduce new APIs for iterating over changes
> > from streamed transactions? Why can't we reuse the same API's as we
> > use for committed xacts?
Done
> >
> > 2.
> > +static void
> > +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
> >
> > Please write some comments atop ReorderBufferStreamCommit.
Done
> >
> > 3.
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > ..
> > + if (txn->snapshot_now
> > == NULL)
> > + {
> > + dlist_iter subxact_i;
> > +
> > + /* make sure this transaction is streamed for the first time */
> > +
> > Assert(!rbtxn_is_streamed(txn));
> > +
> > + /* at the beginning we should have invalid command ID */
> > + Assert(txn->command_id ==
> > InvalidCommandId);
> > +
> > + dlist_foreach(subxact_i, &txn->subtxns)
> > + {
> > + ReorderBufferTXN *subtxn;
> > +
> > +
> > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> > +
> > + if (subtxn->base_snapshot != NULL &&
> > +
> > (txn->base_snapshot == NULL ||
> > + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
> > + {
> > +
> > txn->base_snapshot = subtxn->base_snapshot;
> >
> > The logic here seems to be correct, but I am not sure why it is not
> > considered to purge the base snapshot before assigning the subtxn's
> > snapshot and similarly, we have not purged snapshot for subtxn once we
> > are done with it. I think we can use
> > ReorderBufferTransferSnapToParent to replace part of the logic here.
> > Do you see any reason for doing things differently here?
Done
> >
> > 4. In ReorderBufferStreamTXN, why do you need to use
> > ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.
IMHO, here instead of directly copying the base snapshot we are
modifying it by passing command id and thats the reason we are copying
it.
> >
> > 5. I see a lot of code similarity in ReorderBufferStreamTXN and
> > existing ReorderBufferCommit. I understand that there are some subtle
> > differences due to which we need to write this new function but can't
> > we encapsulate the specific parts of code in functions and then call
> > from both places. I am talking about code in different cases for
> > change->action.
Done
> >
> > 6. + * Note: We never stream and serialize a transaction at the same time (e
> > /(e/(we
Done
I have also found one bug in
"v3-0012-fixup-add-proper-schema-tracking.patch" due to which some of
the streaming test cases were failing, I have created a separate patch
to fix the same.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-30 10:13:09 |
Message-ID: | CAA4eK1KiLzSn8P=rdemZNUs8pkCf9q3VUtWiS9jOjfX2tv=0Mw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have observed some more issues
>
> 1. Currently, In ReorderBufferCommit, it is always expected that
> whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
> two messages can be in different streams. So we need to find a way to
> handle this. Maybe once we get SPEC_INSERT then we can remember the
> tuple and then if we get the SPECT_CONFIRM in the next stream we can
> send that tuple?
>
Your suggestion makes sense to me. So, we can try it.
> 2. During commit time in DecodeCommit we check whether we need to skip
> the changes of the transaction or not by calling
> SnapBuildXactNeedsSkip but since now we support streaming so it's
> possible that before we decode the commit WAL, we might have already
> sent the changes to the output plugin even though we could have
> skipped those changes. So my question is instead of checking at the
> commit time can't we check before adding to ReorderBuffer itself
>
I think if we can do that then the same will be true for current code
irrespective of this patch. I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet. I think we might be able to do that while we try to
stream the changes. I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit. This needs a bit more study.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2019-12-30 10:19:03 |
Message-ID: | CAA4eK1KTpX5rtMkmhja5NbeNkbhwUruUdg2z=87EqE6uQUpMPA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Dec 26, 2019 at 12:36 PM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > >
> > > Thank you for explanation. The plan makes sense. But I think in the
> > > current design it's a problem that logical replication worker doesn't
> > > receive changes (and doesn't check interrupts) during applying
> > > committed changes even if we don't have a worker dedicated for
> > > applying. I think the worker should continue to receive changes and
> > > save them to temporary files even during applying changes.
> > >
> >
> > Won't it beat the purpose of this feature which is to reduce the apply
> > lag? Basically, it can so happen that while applying commit, it
> > constantly gets changes of other transactions which will delay the
> > apply of the current transaction.
>
> You're right. But it seems to me that it optimizes the apply lags of
> only a transaction that made many changes. On the other hand if a
> transaction made many changes applying of subsequent changes are
> delayed.
>
Hmm, how would it be worse than the current situation where once
commit is encountered on the publisher, we won't start with other
transactions until the replay of the same is finished on subscriber?
>
> > I think the best way as
> > discussed is to launch new workers for streamed transactions, but we
> > can do that as an additional feature. Anyway, as proposed, users can
> > choose the streaming mode for subscriptions, so there is an option to
> > turn this selectively.
>
> Yes. But user who wants to use this feature would want to replicate
> many changes but I guess the side effect is quite big. I think that at
> least we need to make the logical replication tolerate such situation.
>
What exactly you mean by "at least we need to make the logical
replication tolerate such situation."? Do you have something specific
in mind?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-02 05:50:10 |
Message-ID: | CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb=FMPpr9_hEB7hozQ-Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have observed some more issues
> >
> > 1. Currently, In ReorderBufferCommit, it is always expected that
> > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
> > two messages can be in different streams. So we need to find a way to
> > handle this. Maybe once we get SPEC_INSERT then we can remember the
> > tuple and then if we get the SPECT_CONFIRM in the next stream we can
> > send that tuple?
> >
>
> Your suggestion makes sense to me. So, we can try it.
Sure.
>
> > 2. During commit time in DecodeCommit we check whether we need to skip
> > the changes of the transaction or not by calling
> > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > possible that before we decode the commit WAL, we might have already
> > sent the changes to the output plugin even though we could have
> > skipped those changes. So my question is instead of checking at the
> > commit time can't we check before adding to ReorderBuffer itself
> >
>
> I think if we can do that then the same will be true for current code
> irrespective of this patch. I think it is possible that we can't take
> that decision while decoding because we haven't assembled a consistent
> snapshot yet. I think we might be able to do that while we try to
> stream the changes. I think we need to take care of all the
> conditions during streaming (when the logical_decoding_workmem limit
> is reached) as we do in DecodeCommit. This needs a bit more study.
I agree.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-03 09:24:26 |
Message-ID: | CAA4eK1JaKW1mj4L6DPnk-V4vXJ6hM=Kcf6+-X+93Jk56UN+kGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Dec 24, 2019 at 10:58 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > I think the way invalidations work for logical replication is that
> > normally, we always start a new transaction before decoding each
> > commit which allows us to accept the invalidations (via
> > AtStart_Cache). However, if there are catalog changes within the
> > transaction being decoded, we need to reflect those before trying to
> > decode the WAL of operation which happened after that catalog change.
> > As we are not logging the WAL for each invalidation, we need to
> > execute all the invalidation messages for this transaction at each
> > catalog change. We are able to do that now as we decode the entire WAL
> > for a transaction only once we get the commit's WAL which contains all
> > the invalidation messages. So, we queue them up and execute them for
> > each catalog change which we identify by WAL record
> > XLOG_HEAP2_NEW_CID.
>
> Thanks for the explanation. That makes sense. But, it's still true,
> AFAICS, that instead of doing this stuff with logging invalidations
> you could just InvalidateSystemCaches() in the cases where you are
> currently applying all of the transaction's invalidations. That
> approach might be worse than changing the way invalidations are
> logged, but the two approaches deserve to be compared. One approach
> has more CPU overhead and the other has more WAL overhead, so it's a
> little hard to compare them, but it seems worth mulling over.
>
I have given some thought over it and it seems to me that this will
increase not only CPU usage but also Network usage. The increase in
CPU usage will be for all WALSenders that decodes a transaction that
has performed DDL. The increase in network usage comes from the fact
that we need to send the schema of relations again which doesn't
require to be invalidated. It is because invalidation blew our local
map that remembers which relation schemas are sent.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-04 04:30:26 |
Message-ID: | CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV+ZcGb3BH6U3x2uxew@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> >
> > Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
> > needs to be in the RelationSyncEntry. In fact, I already have code for
> > that in my private repository - I thought the patches I sent here do
> > include this, but apparently I forgot to include this bit :-(
> >
> > Attached is a rebased patch series, fixing this. It's essentially v2
> > with a couple of patches (0003, 0008, 0009 and 0012) replacing the
> > is_schema_sent with correct handling.
> >
> >
> > 0003 - removes an is_schema_sent reference added prematurely (it's added
> > by a later patch, causing compile failure)
> >
> > 0008 - adds the is_schema_sent back (essentially reverting 0003)
> >
> > 0009 - removes is_schema_sent entirely
> >
> > 0012 - adds the correct handling of schema flags in pgoutput
> >
Thanks for splitting the changes. They are quite clear.
> >
> > I don't know what other changes you've made since v2, so this way it
> > should be possible to just take 0003, 0008, 0009 and 0012 and slip them
> > in with minimal hassle.
> >
> > FWIW thanks to everyone (and Amit and Dilip in particular) working on
> > this patch series. There's been a lot of great reviews and improvements
> > since I abandoned this thread for a while. I expect to be able to spend
> > more time working on this in January.
> >
> +static void
> +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> +{
> + MemoryContext oldctx;
> +
> + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> +
> + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
> +
> + MemoryContextSwitchTo(oldctx);
> +}
> I was looking into the schema tracking solution and I have one
> question, Shouldn't we remove the topxid from the list if the
> (sub)transaction is aborted? because once it is aborted we need to
> resent the schema.
>
I think you are right because, at abort, the subscriber would remove
the changes (for a subtransaction) including the schema changes sent
and then it won't be able to understand the subsequent changes sent by
the publisher. Won't we need to remove xid from the list at commit
time as well, otherwise, the list will keep on growing. One more
thing, we need to search the list of all the relations in the local
map to find xid being aborted/committed, right? If so, won't it be
costly doing at each transaction abort/commit?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-04 06:24:19 |
Message-ID: | CAFiTN-vhQdU3Oi3acgdO-Sx+zEtKogyTbGB83+zmTpAHG+FpVw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 4, 2020 at 10:00 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > +static void
> > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> > +{
> > + MemoryContext oldctx;
> > +
> > + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > +
> > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
> > +
> > + MemoryContextSwitchTo(oldctx);
> > +}
> > I was looking into the schema tracking solution and I have one
> > question, Shouldn't we remove the topxid from the list if the
> > (sub)transaction is aborted? because once it is aborted we need to
> > resent the schema.
> >
>
> I think you are right because, at abort, the subscriber would remove
> the changes (for a subtransaction) including the schema changes sent
> and then it won't be able to understand the subsequent changes sent by
> the publisher. Won't we need to remove xid from the list at commit
> time as well, otherwise, the list will keep on growing.
Yes, we need to remove the xid from the list at the time of commit as well.
One more
> thing, we need to search the list of all the relations in the local
> map to find xid being aborted/committed, right? If so, won't it be
> costly doing at each transaction abort/commit?
Yeah, if multiple concurrent transactions operate on the common
relations then the list can grow longer. I am not sure how many
concurrent large transactions are possible maybe it won't be huge that
searching will be very costly. Otherwise, we can maintain the sorted
array of the xids and do a binary search or we can maintain hash?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-04 10:37:30 |
Message-ID: | CAA4eK1L8AfbXRrEam6KLK5HMPO4CjJp-NqQtTxqOjP+84iDPnQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> Yesterday, Tomas has posted the latest version of the patch set which
> contain the fix for schema send part. Meanwhile, I was working on few
> review comments/bugfixes and refactoring. I have tried to merge those
> changes with the latest patch set except the refactoring related to
> "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
> has also made some changes in the same patch.
>
I don't see any changes by Tomas in that particular patch, am I
missing something?
> I have created a
> separate patch for the same so that we can review the changes and then
> we can merge them to the main patch.
>
It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.
> > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > ----------------------------------------------------------------------------
> > > 1.
> > > xact_desc_invalidations(StringInfo buf,
> > > {
> > > ..
> > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > >
> > > You have removed logging for the above cache but forgot to remove its
> > > reference from one of the places. Also, I think you need to add a
> > > comment somewhere in inval.c to say why you are writing for WAL for
> > > some types of invalidations and not for others?
> Done
>
I don't see any new comments as asked by me. I think we should also
consider WAL logging at each command end instead of doing piecemeal as
discussed in another email [1], which will have lesser code changes
and maybe better in performance. You might want to evaluate the
performance of both approaches.
> > >
> > > 0003-Extend-the-output-plugin-API-with-stream-methods
> > > --------------------------------------------------------------------------------
> > >
> > > 4.
> > > stream_start_cb_wrapper()
> > > {
> > > ..
> > > + /* state.report_location = apply_lsn; */
> > > ..
> > > + /* FIXME ctx->write_location = apply_lsn; */
> > > ..
> > > }
> > >
> > > See, if we can fix these and similar in the callback for the stop. I
> > > think we don't have final_lsn till we commit/abort. Can we compute
> > > before calling these API's?
> Done
>
You have just used final_lsn, but I don't see where you have ensured
that it is set before the API stream_stop_cb_wrapper. I think we need
something similar to what Vignesh has done in one of his bug-fix patch
[2]. See my comment below in this regard.
> > >
> > >
> > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > ----------------------------------------------------------------------------------
> > > 1.
> > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > PG_CATCH();
> > > {
> > > /* TODO: Encapsulate cleanup
> > > from the PG_TRY and PG_CATCH blocks */
> > > +
> > > if (iterstate)
> > > ReorderBufferIterTXNFinish(rb, iterstate);
> > >
> > > Spurious line change.
> > >
> Done
+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+ RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");
Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that and if they do so during decoding (with historic snapshots), the
same should be not allowed.
How about changing the error message to "unexpected heap_getnext call
during logical decoding" or something like that?
> > > 2. The commit message of this patch refers to Prepared transactions.
> > > I think that needs to be changed.
> > >
> > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > -------------------------------------------------------------------------
Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }
Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
2.
+ if (streaming)
+ {
+ /*
+ * Set the CheckXidAlive to the current (sub)xid for which this
+ * change belongs to so that we can detect the abort while we are
+ * decoding.
+ */
+ CheckXidAlive = change->txn->xid;
+
+ /* Increment the stream count. */
+ streamed++;
+ }
Is the variable 'streamed' used anywhere?
3.
+ /*
+ * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+ * any memory. We could also keep the hash table and update it with
+ * new ctid values, but this seems simpler and good enough for now.
+ */
+ ReorderBufferDestroyTupleCidHash(rb, txn);
Won't this be required only when we are streaming changes?
As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3].
b. Complete the handling of schema_sent as discussed above [4].
c. Few comments by Vignesh and the response on the same by me [5][6].
d. WAL overhead and performance testing for additional WAL logging by
this patchset.
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7].
Have I missed anything?
[1] - https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
[6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
[7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-06 03:50:57 |
Message-ID: | CAFiTN-tpq-5_9f8fG1Y9iqZaxFXfKbFQgYD=KqC8mmhKEN=N-g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > Yesterday, Tomas has posted the latest version of the patch set which
> > contain the fix for schema send part. Meanwhile, I was working on few
> > review comments/bugfixes and refactoring. I have tried to merge those
> > changes with the latest patch set except the refactoring related to
> > "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
> > has also made some changes in the same patch.
> >
>
> I don't see any changes by Tomas in that particular patch, am I
> missing something?
He has created some sub-patch from the main patch for handling
schema-sent issue. So if I make change in that patch all other
patches will conflict.
>
> > I have created a
> > separate patch for the same so that we can review the changes and then
> > we can merge them to the main patch.
> >
>
> It is better to merge it with the main patch for
> "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> difficult to review.
Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007). Basically, if we merge all of them then we don't need to deal
with the conflict. I think Tomas has kept them separate so that we
can review the solution for the schema sent. And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.
>
> > > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > > ----------------------------------------------------------------------------
> > > > 1.
> > > > xact_desc_invalidations(StringInfo buf,
> > > > {
> > > > ..
> > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > > >
> > > > You have removed logging for the above cache but forgot to remove its
> > > > reference from one of the places. Also, I think you need to add a
> > > > comment somewhere in inval.c to say why you are writing for WAL for
> > > > some types of invalidations and not for others?
> > Done
> >
>
> I don't see any new comments as asked by me.
Oh, I just fixed one part of the comment and overlooked the rest. Will fix.
I think we should also
> consider WAL logging at each command end instead of doing piecemeal as
> discussed in another email [1], which will have lesser code changes
> and maybe better in performance. You might want to evaluate the
> performance of both approaches.
Ok
>
> > > >
> > > > 0003-Extend-the-output-plugin-API-with-stream-methods
> > > > --------------------------------------------------------------------------------
> > > >
> > > > 4.
> > > > stream_start_cb_wrapper()
> > > > {
> > > > ..
> > > > + /* state.report_location = apply_lsn; */
> > > > ..
> > > > + /* FIXME ctx->write_location = apply_lsn; */
> > > > ..
> > > > }
> > > >
> > > > See, if we can fix these and similar in the callback for the stop. I
> > > > think we don't have final_lsn till we commit/abort. Can we compute
> > > > before calling these API's?
> > Done
> >
>
> You have just used final_lsn, but I don't see where you have ensured
> that it is set before the API stream_stop_cb_wrapper. I think we need
> something similar to what Vignesh has done in one of his bug-fix patch
> [2]. See my comment below in this regard.
You can refer below hunk in 0018.
+ /*
+ * Done with current changes, call stream_stop callback for streaming
+ * transaction, commit callback otherwise.
+ */
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }
>
> > > >
> > > >
> > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > > ----------------------------------------------------------------------------------
> > > > 1.
> > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > > PG_CATCH();
> > > > {
> > > > /* TODO: Encapsulate cleanup
> > > > from the PG_TRY and PG_CATCH blocks */
> > > > +
> > > > if (iterstate)
> > > > ReorderBufferIterTXNFinish(rb, iterstate);
> > > >
> > > > Spurious line change.
> > > >
> > Done
>
> + /*
> + * We don't expect direct calls to heap_getnext with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> + elog(ERROR, "improper heap_getnext call");
>
> Earlier, I thought we don't need to check if it is a regular table in
> this check, but it is required because output plugins can try to do
> that
I did not understand that, can you give some example?
and if they do so during decoding (with historic snapshots), the
> same should be not allowed.
>
> How about changing the error message to "unexpected heap_getnext call
> during logical decoding" or something like that?
Ok
>
> > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > I think that needs to be changed.
> > > >
> > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > -------------------------------------------------------------------------
>
> Few comments on v4-0018-Review-comment-fix-and-refactoring:
> 1.
> + if (streaming)
> + {
> + /*
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
Isn't it the same, there we are doing while serializing and here we
are doing while streaming? Basically, the last LSN we streamed. Am I
missing something?
>
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the CheckXidAlive to the current (sub)xid for which this
> + * change belongs to so that we can detect the abort while we are
> + * decoding.
> + */
> + CheckXidAlive = change->txn->xid;
> +
> + /* Increment the stream count. */
> + streamed++;
> + }
>
> Is the variable 'streamed' used anywhere?
>
> 3.
> + /*
> + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
> + * any memory. We could also keep the hash table and update it with
> + * new ctid values, but this seems simpler and good enough for now.
> + */
> + ReorderBufferDestroyTupleCidHash(rb, txn);
>
> Won't this be required only when we are streaming changes?
I will work on this review comments and reply to them separately along
with the patch.
>
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].
> b. Complete the handling of schema_sent as discussed above [4].
> c. Few comments by Vignesh and the response on the same by me [5][6].
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset.
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7].
>
> Have I missed anything?
I think this is the list I remember. Apart from these few points by
Robert which are still under discussion[8].
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com
> [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
> [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
> [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
> [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-06 08:41:38 |
Message-ID: | CAA4eK1J-TgA_GdcjFHs9xQv_65hE=EvBXQkGo-mhOoB23uk2jA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > It is better to merge it with the main patch for
> > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > difficult to review.
> Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> (0007). Basically, if we merge all of them then we don't need to deal
> with the conflict. I think Tomas has kept them separate so that we
> can review the solution for the schema sent. And, I kept 0018 as a
> separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> In the next patch set, I will merge all of them to 0007.
>
Okay, I think we can merge those patches.
> >
> > + /*
> > + * We don't expect direct calls to heap_getnext with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> > + elog(ERROR, "improper heap_getnext call");
> >
> > Earlier, I thought we don't need to check if it is a regular table in
> > this check, but it is required because output plugins can try to do
> > that
> I did not understand that, can you give some example?
>
I think it can lead to the same problem of concurrent aborts as for
catalog scans.
> >
> > > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > > I think that needs to be changed.
> > > > >
> > > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > > -------------------------------------------------------------------------
> >
> > Few comments on v4-0018-Review-comment-fix-and-refactoring:
> > 1.
> > + if (streaming)
> > + {
> > + /*
> > + * Set the last last of the stream as the final lsn before calling
> > + * stream stop.
> > + */
> > + txn->final_lsn = prev_lsn;
> > + rb->stream_stop(rb, txn);
> > + }
> >
> > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
> Isn't it the same, there we are doing while serializing and here we
> are doing while streaming? Basically, the last LSN we streamed. Am I
> missing something?
>
No, I think you are right.
Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}
Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened? And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?
2.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+ txn-
>snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
txn, command_id);
I don't see where the txn->snapshot_now is getting freed. The
base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
this getting freed.
3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }
Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?
4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).
v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
1.
@@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (using_subtxn)
RollbackAndReleaseCurrentSubTransaction();
- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata-
>sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+
}
+ else
+ {
+ /* remember the command ID and snapshot for the streaming run */
+ txn-
>command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
txn, command_id);
+ rb->stream_stop(rb, txn);
+
+
FlushErrorState();
+ }
Can you update comments either in the above code block or some other
place to explain what is the concurrent abort problem and how we dealt
with it? Also, please explain how the above error handling is
sufficient to address all the various scenarios (sub-transaction got
aborted when we have already sent some changes, or when we have not
sent any changes yet).
v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
Why here we can't use TransactionIdDidAbort? If we can't use it, then
can you add comments stating the reason of the same.
2.
/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
In comments, there is a mention of a prepared transaction. Do we
allow prepared transactions to be decoded as part of this patch?
3.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid
(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
This comment just says what code below is doing, can you explain the
rationale behind this check. It would be better if it is clear by
reading comments, why we are doing this check after fetching the
tuple. I think this can refer to the comment I suggested to add for
changes in patch
v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-06 10:26:25 |
Message-ID: | CAFiTN-vB5BHhg1iJ0Nx5GX7xYcxWiY_=3sGyym87AzVACZHqdA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > It is better to merge it with the main patch for
> > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > > difficult to review.
> > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> > (0007). Basically, if we merge all of them then we don't need to deal
> > with the conflict. I think Tomas has kept them separate so that we
> > can review the solution for the schema sent. And, I kept 0018 as a
> > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> > In the next patch set, I will merge all of them to 0007.
> >
>
> Okay, I think we can merge those patches.
ok
>
> > >
> > > + /*
> > > + * We don't expect direct calls to heap_getnext with valid
> > > + * CheckXidAlive for regular tables. Track that below.
> > > + */
> > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> > > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> > > + elog(ERROR, "improper heap_getnext call");
> > >
> > > Earlier, I thought we don't need to check if it is a regular table in
> > > this check, but it is required because output plugins can try to do
> > > that
> > I did not understand that, can you give some example?
> >
>
> I think it can lead to the same problem of concurrent aborts as for
> catalog scans.
Yeah, got it.
>
> > >
> > > > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > > > I think that needs to be changed.
> > > > > >
> > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > > > -------------------------------------------------------------------------
> > >
> > > Few comments on v4-0018-Review-comment-fix-and-refactoring:
> > > 1.
> > > + if (streaming)
> > > + {
> > > + /*
> > > + * Set the last last of the stream as the final lsn before calling
> > > + * stream stop.
> > > + */
> > > + txn->final_lsn = prev_lsn;
> > > + rb->stream_stop(rb, txn);
> > > + }
> > >
> > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
> > Isn't it the same, there we are doing while serializing and here we
> > are doing while streaming? Basically, the last LSN we streamed. Am I
> > missing something?
> >
>
> No, I think you are right.
>
> Few more comments:
> --------------------------------
> v4-0007-Implement-streaming-mode-in-ReorderBuffer
> 1.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about
> subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now
> = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn,
> command_id);
> ..
> }
>
> Why are we using base snapshot here instead of the snapshot we saved
> the first time streaming has happened? And as mentioned in comments,
> won't we need to consider the snapshots for subtransactions that
> arrived after the last time we have streamed the changes?
>
> 2.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> + txn-
> >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
> txn, command_id);
>
> I don't see where the txn->snapshot_now is getting freed. The
> base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
> this getting freed.
Ok, I will check that and fix.
>
> 3.
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * If this is a subxact, we need to stream the top-level transaction
> + * instead.
> + */
> + if (txn->toptxn)
> + {
> +
> ReorderBufferStreamTXN(rb, txn->toptxn);
> + return;
> + }
>
> Is it ever possible that we reach here for subtransaction, if not,
> then it should be Assert rather than if condition?
ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.
I will analyze your other comments and fix them in the next version.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-06 11:06:04 |
Message-ID: | CAA4eK1+OhaQpc3tqsAHkbJY7CGh+uB7hputB5V+zroz6O1WmtQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > 3.
> > +static void
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > + /*
> > + * If this is a subxact, we need to stream the top-level transaction
> > + * instead.
> > + */
> > + if (txn->toptxn)
> > + {
> > +
> > ReorderBufferStreamTXN(rb, txn->toptxn);
> > + return;
> > + }
> >
> > Is it ever possible that we reach here for subtransaction, if not,
> > then it should be Assert rather than if condition?
>
> ReorderBufferCheckMemoryLimit, can call it either for the
> subtransaction or for the main transaction, depends upon in which
> ReorderBufferTXN you are adding the current change.
>
That function has code like below:
ReorderBufferCheckMemoryLimit()
{
..
if (ReorderBufferCanStream(rb))
{
/*
* Pick the largest toplevel transaction and evict it from memory by
* streaming the already decoded part.
*/
txn = ReorderBufferLargestTopTXN(rb);
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
..
ReorderBufferStreamTXN(rb, txn);
..
}
How can it ReorderBufferTXN pass for subtransaction?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-06 11:14:17 |
Message-ID: | CAFiTN-vc7havbtwSKCzei6T9-N2MJ3cHxYMh66Rt5oSu7WXiRg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > 3.
> > > +static void
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * If this is a subxact, we need to stream the top-level transaction
> > > + * instead.
> > > + */
> > > + if (txn->toptxn)
> > > + {
> > > +
> > > ReorderBufferStreamTXN(rb, txn->toptxn);
> > > + return;
> > > + }
> > >
> > > Is it ever possible that we reach here for subtransaction, if not,
> > > then it should be Assert rather than if condition?
> >
> > ReorderBufferCheckMemoryLimit, can call it either for the
> > subtransaction or for the main transaction, depends upon in which
> > ReorderBufferTXN you are adding the current change.
> >
>
> That function has code like below:
>
> ReorderBufferCheckMemoryLimit()
> {
> ..
> if (ReorderBufferCanStream(rb))
> {
> /*
> * Pick the largest toplevel transaction and evict it from memory by
> * streaming the already decoded part.
> */
> txn = ReorderBufferLargestTopTXN(rb);
> /* we know there has to be one, because the size is not zero */
> Assert(txn && !txn->toptxn);
> ..
> ReorderBufferStreamTXN(rb, txn);
> ..
> }
>
> How can it ReorderBufferTXN pass for subtransaction?
>
Hmm, I missed it. You are right, will fix it.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-08 07:42:22 |
Message-ID: | CAFiTN-uRPwrW4tuNDWmohqCCsyyJ71kunCFNcVn7ot5KQ67ULQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 4:44 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > 3.
> > > > +static void
> > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > > {
> > > > ..
> > > > + /*
> > > > + * If this is a subxact, we need to stream the top-level transaction
> > > > + * instead.
> > > > + */
> > > > + if (txn->toptxn)
> > > > + {
> > > > +
> > > > ReorderBufferStreamTXN(rb, txn->toptxn);
> > > > + return;
> > > > + }
> > > >
> > > > Is it ever possible that we reach here for subtransaction, if not,
> > > > then it should be Assert rather than if condition?
> > >
> > > ReorderBufferCheckMemoryLimit, can call it either for the
> > > subtransaction or for the main transaction, depends upon in which
> > > ReorderBufferTXN you are adding the current change.
> > >
> >
> > That function has code like below:
> >
> > ReorderBufferCheckMemoryLimit()
> > {
> > ..
> > if (ReorderBufferCanStream(rb))
> > {
> > /*
> > * Pick the largest toplevel transaction and evict it from memory by
> > * streaming the already decoded part.
> > */
> > txn = ReorderBufferLargestTopTXN(rb);
> > /* we know there has to be one, because the size is not zero */
> > Assert(txn && !txn->toptxn);
> > ..
> > ReorderBufferStreamTXN(rb, txn);
> > ..
> > }
> >
> > How can it ReorderBufferTXN pass for subtransaction?
> >
> Hmm, I missed it. You are right, will fix it.
>
I have observed one more design issue. The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams. So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-09 04:05:45 |
Message-ID: | CAA4eK1LkjGExGKDWSLAFYmNW3c2pKm0oCmPYqnkLiMi-60mHDg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have observed one more design issue.
>
Good observation.
> The problem is that when we
> get a toasted chunks we remember the changes in the memory(hash table)
> but don't stream until we get the actual change on the main table.
> Now, the problem is that we might get the change of the toasted table
> and the main table in different streams. So basically, in a stream,
> if we have only got the toasted tuples then even after
> ReorderBufferStreamTXN the memory usage will not be reduced.
>
I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple. Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction. I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-09 05:00:10 |
Message-ID: | CAFiTN-v8X3Hhm+b6mG57FHp7-ORKZYcOOtSND33eyEicZ_cKRA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have observed one more design issue.
> >
>
> Good observation.
>
> > The problem is that when we
> > get a toasted chunks we remember the changes in the memory(hash table)
> > but don't stream until we get the actual change on the main table.
> > Now, the problem is that we might get the change of the toasted table
> > and the main table in different streams. So basically, in a stream,
> > if we have only got the toasted tuples then even after
> > ReorderBufferStreamTXN the memory usage will not be reduced.
> >
>
> I think we can't split such changes in a different stream (unless we
> design an entirely new solution to send partial changes of toast
> data), so we need to send them together. We can keep a flag like
> data_complete in ReorderBufferTxn and mark it complete only when we
> are able to assemble the entire tuple. Now, whenever, we try to
> stream the changes once we reach the memory threshold, we can check
> whether the data_complete flag is true, if so, then only send the
> changes, otherwise, we can pick the next largest transaction. I think
> we can retry it for few times and if we get the incomplete data for
> multiple transactions, then we can decide to spill the transaction or
> maybe we can directly spill the first largest transaction which has
> incomplete data.
>
Yeah, we might do something on this line. Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction). Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction. So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill. Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-09 06:39:06 |
Message-ID: | CAA4eK1L5PyRZMS0B8C+d_RCHo0VX6hu6D6tPnXnqPhy4tcNtFQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > I have observed one more design issue.
> > >
> >
> > Good observation.
> >
> > > The problem is that when we
> > > get a toasted chunks we remember the changes in the memory(hash table)
> > > but don't stream until we get the actual change on the main table.
> > > Now, the problem is that we might get the change of the toasted table
> > > and the main table in different streams. So basically, in a stream,
> > > if we have only got the toasted tuples then even after
> > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > >
> >
> > I think we can't split such changes in a different stream (unless we
> > design an entirely new solution to send partial changes of toast
> > data), so we need to send them together. We can keep a flag like
> > data_complete in ReorderBufferTxn and mark it complete only when we
> > are able to assemble the entire tuple. Now, whenever, we try to
> > stream the changes once we reach the memory threshold, we can check
> > whether the data_complete flag is true, if so, then only send the
> > changes, otherwise, we can pick the next largest transaction. I think
> > we can retry it for few times and if we get the incomplete data for
> > multiple transactions, then we can decide to spill the transaction or
> > maybe we can directly spill the first largest transaction which has
> > incomplete data.
> >
> Yeah, we might do something on this line. Basically, we need to mark
> the top-transaction as data-incomplete if any of its subtransaction is
> having data-incomplete (it will always be the latest sub-transaction
> of the top transaction). Also, for streaming, we are checking the
> largest top transaction whereas for spilling we just need the larget
> (sub) transaction. So we also need to decide while picking the
> largest top transaction for streaming, if we get a few transactions
> with in-complete data then how we will go for the spill. Do we spill
> all the sub-transactions under this top transaction or we will again
> find the larget (sub) transaction for spilling.
>
I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-09 07:10:52 |
Message-ID: | CAFiTN-vV_eO7xjJq0iHyFqcMA2GiohPPnd-ohLhib9vMqG0Z1w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > I have observed one more design issue.
> > > >
> > >
> > > Good observation.
> > >
> > > > The problem is that when we
> > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > but don't stream until we get the actual change on the main table.
> > > > Now, the problem is that we might get the change of the toasted table
> > > > and the main table in different streams. So basically, in a stream,
> > > > if we have only got the toasted tuples then even after
> > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > >
> > >
> > > I think we can't split such changes in a different stream (unless we
> > > design an entirely new solution to send partial changes of toast
> > > data), so we need to send them together. We can keep a flag like
> > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > are able to assemble the entire tuple. Now, whenever, we try to
> > > stream the changes once we reach the memory threshold, we can check
> > > whether the data_complete flag is true, if so, then only send the
> > > changes, otherwise, we can pick the next largest transaction. I think
> > > we can retry it for few times and if we get the incomplete data for
> > > multiple transactions, then we can decide to spill the transaction or
> > > maybe we can directly spill the first largest transaction which has
> > > incomplete data.
> > >
> > Yeah, we might do something on this line. Basically, we need to mark
> > the top-transaction as data-incomplete if any of its subtransaction is
> > having data-incomplete (it will always be the latest sub-transaction
> > of the top transaction). Also, for streaming, we are checking the
> > largest top transaction whereas for spilling we just need the larget
> > (sub) transaction. So we also need to decide while picking the
> > largest top transaction for streaming, if we get a few transactions
> > with in-complete data then how we will go for the spill. Do we spill
> > all the sub-transactions under this top transaction or we will again
> > find the larget (sub) transaction for spilling.
> >
>
> I think it is better to do later as that will lead to the spill of
> only required (minimum changes to get the memory below threshold)
> changes.
Make sense to me.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 04:44:05 |
Message-ID: | CAFiTN-snMb=53oqkM8av8Lqfxojjm4OBwCNxmFssgLCceY_zgg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > It is better to merge it with the main patch for
> > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > > difficult to review.
> > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> > (0007). Basically, if we merge all of them then we don't need to deal
> > with the conflict. I think Tomas has kept them separate so that we
> > can review the solution for the schema sent. And, I kept 0018 as a
> > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> > In the next patch set, I will merge all of them to 0007.
> >
>
> Okay, I think we can merge those patches.
Done
0008, 0009, 0017, 0018 are merged to 0007, 0012 is merged to 0010
>
> Few more comments:
> --------------------------------
> v4-0007-Implement-streaming-mode-in-ReorderBuffer
> 1.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about
> subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now
> = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn,
> command_id);
> ..
> }
>
> Why are we using base snapshot here instead of the snapshot we saved
> the first time streaming has happened? And as mentioned in comments,
> won't we need to consider the snapshots for subtransactions that
> arrived after the last time we have streamed the changes?
Fixed
>
> 2.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> + txn-
> >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
> txn, command_id);
>
> I don't see where the txn->snapshot_now is getting freed. The
> base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
> this getting freed.
I have freed this In ReorderBufferCleanupTXN
>
> 3.
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * If this is a subxact, we need to stream the top-level transaction
> + * instead.
> + */
> + if (txn->toptxn)
> + {
> +
> ReorderBufferStreamTXN(rb, txn->toptxn);
> + return;
> + }
>
> Is it ever possible that we reach here for subtransaction, if not,
> then it should be Assert rather than if condition?
Fixed
>
> 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> especially to cover the case when it gets called due to memory
> overflow (aka via ReorderBufferCheckMemoryLimit).
We get origin_lsn during commit time so I am not sure how can we do
that. I have also noticed that currently, we are not using origin_lsn
on the subscriber side. I think need more investigation that if we
want this then do we need to log it early.
>
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
> 1.
> @@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
> if (using_subtxn)
>
> RollbackAndReleaseCurrentSubTransaction();
>
> - PG_RE_THROW();
> + /* re-throw only if it's not an abort */
> + if (errdata-
> >sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> +
> }
> + else
> + {
> + /* remember the command ID and snapshot for the streaming run */
> + txn-
> >command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
> txn, command_id);
> + rb->stream_stop(rb, txn);
> +
> +
> FlushErrorState();
> + }
>
> Can you update comments either in the above code block or some other
> place to explain what is the concurrent abort problem and how we dealt
> with it? Also, please explain how the above error handling is
> sufficient to address all the various scenarios (sub-transaction got
> aborted when we have already sent some changes, or when we have not
> sent any changes yet).
Done
>
> v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> 1.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
>
> Why here we can't use TransactionIdDidAbort? If we can't use it, then
> can you add comments stating the reason of the same.
Done
>
> 2.
> /*
> + * An xid value pointing to a possibly ongoing or a prepared transaction.
> + * Currently used in logical decoding. It's possible that such transactions
> + * can get aborted while the decoding is ongoing.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
>
> In comments, there is a mention of a prepared transaction. Do we
> allow prepared transactions to be decoded as part of this patch?
Fixed
>
> 3.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid
> (CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
>
> This comment just says what code below is doing, can you explain the
> rationale behind this check. It would be better if it is clear by
> reading comments, why we are doing this check after fetching the
> tuple. I think this can refer to the comment I suggested to add for
> changes in patch
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 04:51:55 |
Message-ID: | CAFiTN-s+tKU0amysEyj0p=LMR7ooJYZ3-p7O-kJF7rsB9uTHjw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > > ----------------------------------------------------------------------------
> > > > 1.
> > > > xact_desc_invalidations(StringInfo buf,
> > > > {
> > > > ..
> > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > > >
> > > > You have removed logging for the above cache but forgot to remove its
> > > > reference from one of the places. Also, I think you need to add a
> > > > comment somewhere in inval.c to say why you are writing for WAL for
> > > > some types of invalidations and not for others?
> > Done
> >
>
> I don't see any new comments as asked by me.
Done
I think we should also
> consider WAL logging at each command end instead of doing piecemeal as
> discussed in another email [1], which will have lesser code changes
> and maybe better in performance. You might want to evaluate the
> performance of both approaches.
Still pending, will work on this.
>
> > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > > ----------------------------------------------------------------------------------
> > > > 1.
> > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > > PG_CATCH();
> > > > {
> > > > /* TODO: Encapsulate cleanup
> > > > from the PG_TRY and PG_CATCH blocks */
> > > > +
> > > > if (iterstate)
> > > > ReorderBufferIterTXNFinish(rb, iterstate);
> > > >
> > > > Spurious line change.
> > > >
> > Done
>
> + /*
> + * We don't expect direct calls to heap_getnext with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> + elog(ERROR, "improper heap_getnext call");
>
> Earlier, I thought we don't need to check if it is a regular table in
> this check, but it is required because output plugins can try to do
> that and if they do so during decoding (with historic snapshots), the
> same should be not allowed.
>
> How about changing the error message to "unexpected heap_getnext call
> during logical decoding" or something like that?
Done
>
> > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > I think that needs to be changed.
> > > >
> > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > -------------------------------------------------------------------------
>
> Few comments on v4-0018-Review-comment-fix-and-refactoring:
> 1.
> + if (streaming)
> + {
> + /*
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
Already agreed upon current implementation
>
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the CheckXidAlive to the current (sub)xid for which this
> + * change belongs to so that we can detect the abort while we are
> + * decoding.
> + */
> + CheckXidAlive = change->txn->xid;
> +
> + /* Increment the stream count. */
> + streamed++;
> + }
>
> Is the variable 'streamed' used anywhere?
Removed
>
> 3.
> + /*
> + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
> + * any memory. We could also keep the hash table and update it with
> + * new ctid values, but this seems simpler and good enough for now.
> + */
> + ReorderBufferDestroyTupleCidHash(rb, txn);
>
> Won't this be required only when we are streaming changes?
Fixed
>
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].
> b. Complete the handling of schema_sent as discussed above [4].
> c. Few comments by Vignesh and the response on the same by me [5][6].
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset.
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7].
>
> Have I missed anything?
I have worked upon most of these items, I will reply to them separately.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 05:22:51 |
Message-ID: | CAFiTN-sn-BaLJw9HmbRe+=6Ju7FoMEFftRE=8JzzdoowmX0fcw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have observed some more issues
> >
> > 1. Currently, In ReorderBufferCommit, it is always expected that
> > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those
> > two messages can be in different streams. So we need to find a way to
> > handle this. Maybe once we get SPEC_INSERT then we can remember the
> > tuple and then if we get the SPECT_CONFIRM in the next stream we can
> > send that tuple?
> >
>
> Your suggestion makes sense to me. So, we can try it.
I have implemented this and attached it as a separate patch. In my
latest patch set[1]
>
> > 2. During commit time in DecodeCommit we check whether we need to skip
> > the changes of the transaction or not by calling
> > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > possible that before we decode the commit WAL, we might have already
> > sent the changes to the output plugin even though we could have
> > skipped those changes. So my question is instead of checking at the
> > commit time can't we check before adding to ReorderBuffer itself
> >
>
> I think if we can do that then the same will be true for current code
> irrespective of this patch. I think it is possible that we can't take
> that decision while decoding because we haven't assembled a consistent
> snapshot yet. I think we might be able to do that while we try to
> stream the changes. I think we need to take care of all the
> conditions during streaming (when the logical_decoding_workmem limit
> is reached) as we do in DecodeCommit. This needs a bit more study.
I have analyzed this further and I think we can not decide all the
conditions even while streaming. Because IMHO once we get the
SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
so that if we get the commit of the transaction after we reach to the
SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
which might get dropped later but that we can not decide while
streaming.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 20:56:49 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I pushed 0005 (the rbtxn flags thing) after some light editing.
It's been around for long enough ...
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 21:35:41 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Here's a rebase of this patch series. I didn't change anything except
1. disregard what was 0005, since I already pushed it.
2. roll 0003 into 0002.
3. rebase 0007 (now 0005) to account for the reorderbuffer changes.
(I did notice that 0005 adds a new boolean any_data_sent, which is
silly -- it should be another txn_flags bit.)
However, tests don't pass for me; notably, test_decoding crashes.
OTOH I noticed that the streamed transaction support in test_decoding
writes the XID to the output, which is going to make it useless for
regression testing. It probably should not emit the numerical values.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-10 21:37:07 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-Jan-10, Alvaro Herrera wrote:
> Here's a rebase of this patch series. I didn't change anything except
... this time with attachments ...
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
v6-0001-Immediately-WAL-log-assignments.patch | text/x-diff | 10.4 KB |
v6-0002-Issue-individual-invalidations-with-wal_level-log.patch | text/x-diff | 16.3 KB |
v6-0003-Extend-the-output-plugin-API-with-stream-methods.patch | text/x-diff | 34.8 KB |
v6-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patch | text/x-diff | 13.0 KB |
v6-0005-Implement-streaming-mode-in-ReorderBuffer.patch | text/x-diff | 37.0 KB |
v6-0006-Fix-speculative-insert-bug.patch | text/x-diff | 2.5 KB |
v6-0007-Support-logical_decoding_work_mem-set-from-create.patch | text/x-diff | 13.1 KB |
v6-0008-Add-support-for-streaming-to-built-in-replication.patch | text/x-diff | 91.3 KB |
v6-0009-Track-statistics-for-streaming.patch | text/x-diff | 11.7 KB |
v6-0010-Enable-streaming-for-all-subscription-TAP-tests.patch | text/x-diff | 14.8 KB |
v6-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch | text/x-diff | 1018 bytes |
v6-0012-Add-TAP-test-for-streaming-vs.-DDL.patch | text/x-diff | 4.4 KB |
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-11 02:48:01 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-Jan-10, Alvaro Herrera wrote:
> From 7d671806584fff71067c8bde38b2f642ba1331a9 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilip(dot)kumar(at)enterprisedb(dot)com>
> Date: Wed, 20 Nov 2019 16:41:13 +0530
> Subject: [PATCH v6 10/12] Enable streaming for all subscription TAP tests
This patch turns a lot of test into the streamed mode. While it's
great that streaming mode is tested, we should add new tests for it
rather than failing to keep tests for the non-streamed mode. I suggest
that we add two versions of each test, one for each mode. Maybe the way
to do that is to create some subroutine that can be called twice.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-13 09:48:12 |
Message-ID: | CAFiTN-sFHrCqhOm3+xT909dAhogmdFRsrTCxCOkPRCq0tVBpjw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > I have observed one more design issue.
> > > >
> > >
> > > Good observation.
> > >
> > > > The problem is that when we
> > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > but don't stream until we get the actual change on the main table.
> > > > Now, the problem is that we might get the change of the toasted table
> > > > and the main table in different streams. So basically, in a stream,
> > > > if we have only got the toasted tuples then even after
> > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > >
> > >
> > > I think we can't split such changes in a different stream (unless we
> > > design an entirely new solution to send partial changes of toast
> > > data), so we need to send them together. We can keep a flag like
> > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > are able to assemble the entire tuple. Now, whenever, we try to
> > > stream the changes once we reach the memory threshold, we can check
> > > whether the data_complete flag is true, if so, then only send the
> > > changes, otherwise, we can pick the next largest transaction. I think
> > > we can retry it for few times and if we get the incomplete data for
> > > multiple transactions, then we can decide to spill the transaction or
> > > maybe we can directly spill the first largest transaction which has
> > > incomplete data.
> > >
> > Yeah, we might do something on this line. Basically, we need to mark
> > the top-transaction as data-incomplete if any of its subtransaction is
> > having data-incomplete (it will always be the latest sub-transaction
> > of the top transaction). Also, for streaming, we are checking the
> > largest top transaction whereas for spilling we just need the larget
> > (sub) transaction. So we also need to decide while picking the
> > largest top transaction for streaming, if we get a few transactions
> > with in-complete data then how we will go for the spill. Do we spill
> > all the sub-transactions under this top transaction or we will again
> > find the larget (sub) transaction for spilling.
> >
>
> I think it is better to do later as that will lead to the spill of
> only required (minimum changes to get the memory below threshold)
> changes.
I think instead of doing this can't we just spill the changes which
are in toast_hash. Basically, at the end of the stream, we have some
toast tuple which we could not stream because we did not have the
insert for the main table then we can spill only those changes which
are in tuple hash. And, in the subsequent stream whenever we get the
insert for the main table at that time we can restore those changes
and stream. We can also maintain some flag saying data is not
complete (with some change LSN number) and after that LSN we can spill
any toast change to disk until we get the change for the main table,
by this way we can avoid building tuple hash until we get the change
for the main table.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-14 05:13:54 |
Message-ID: | CAA4eK1LatZkq2Got+DZHm5X=gsD6QxMwe0bJfdUHX3gQyXW0Rw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > > The problem is that when we
> > > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > > but don't stream until we get the actual change on the main table.
> > > > > Now, the problem is that we might get the change of the toasted table
> > > > > and the main table in different streams. So basically, in a stream,
> > > > > if we have only got the toasted tuples then even after
> > > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > > >
> > > >
> > > > I think we can't split such changes in a different stream (unless we
> > > > design an entirely new solution to send partial changes of toast
> > > > data), so we need to send them together. We can keep a flag like
> > > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > > are able to assemble the entire tuple. Now, whenever, we try to
> > > > stream the changes once we reach the memory threshold, we can check
> > > > whether the data_complete flag is true
Here, we can also consider streaming the changes when data_complete is
false, but some additional changes have been added to the same txn as
the new changes might complete the tuple.
> > > > , if so, then only send the
> > > > changes, otherwise, we can pick the next largest transaction. I think
> > > > we can retry it for few times and if we get the incomplete data for
> > > > multiple transactions, then we can decide to spill the transaction or
> > > > maybe we can directly spill the first largest transaction which has
> > > > incomplete data.
> > > >
> > > Yeah, we might do something on this line. Basically, we need to mark
> > > the top-transaction as data-incomplete if any of its subtransaction is
> > > having data-incomplete (it will always be the latest sub-transaction
> > > of the top transaction). Also, for streaming, we are checking the
> > > largest top transaction whereas for spilling we just need the larget
> > > (sub) transaction. So we also need to decide while picking the
> > > largest top transaction for streaming, if we get a few transactions
> > > with in-complete data then how we will go for the spill. Do we spill
> > > all the sub-transactions under this top transaction or we will again
> > > find the larget (sub) transaction for spilling.
> > >
> >
> > I think it is better to do later as that will lead to the spill of
> > only required (minimum changes to get the memory below threshold)
> > changes.
> I think instead of doing this can't we just spill the changes which
> are in toast_hash. Basically, at the end of the stream, we have some
> toast tuple which we could not stream because we did not have the
> insert for the main table then we can spill only those changes which
> are in tuple hash.
>
Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so. Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling. Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-14 05:26:37 |
Message-ID: | CAFiTN-s_GEfcnU-2i4mmr1GWyc7=ukLaVN5RnZ5O=Vqw8m5baA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>
> On 2020-Jan-10, Alvaro Herrera wrote:
>
> > Here's a rebase of this patch series. I didn't change anything except
>
> ... this time with attachments ...
The patch set fails to apply on the head so rebased. (Rebased on
commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-21 16:36:07 |
Message-ID: | 20200121163607.7ycuizly7eljci3d@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 14, 2020 at 10:56:37AM +0530, Dilip Kumar wrote:
>On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>>
>> On 2020-Jan-10, Alvaro Herrera wrote:
>>
>> > Here's a rebase of this patch series. I didn't change anything except
>>
>> ... this time with attachments ...
>The patch set fails to apply on the head so rebased. (Rebased on
>commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)
>
I've noticed the patch was in WoA state since 2019/12/01, but there's
been quite a lot of traffic on this thread and a bunch of new patch
versions. So I've switched it to "needs review" - if that's not the
right status, let me know.
Also, the patch was moved forward mostly by Amit and Dilip, so I've
added them as authors in the CF app (well, what matters is the commit
message, of course, but let's keep this up to date too).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-22 05:00:25 |
Message-ID: | CAFiTN-s6ADLx8fyRYBbKtY08-eDPaBV+gtKQbn0jm1Uw2gC_oQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > > The problem is that when we
> > > > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > > > but don't stream until we get the actual change on the main table.
> > > > > > Now, the problem is that we might get the change of the toasted table
> > > > > > and the main table in different streams. So basically, in a stream,
> > > > > > if we have only got the toasted tuples then even after
> > > > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > > > >
> > > > >
> > > > > I think we can't split such changes in a different stream (unless we
> > > > > design an entirely new solution to send partial changes of toast
> > > > > data), so we need to send them together. We can keep a flag like
> > > > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > > > are able to assemble the entire tuple. Now, whenever, we try to
> > > > > stream the changes once we reach the memory threshold, we can check
> > > > > whether the data_complete flag is true
>
> Here, we can also consider streaming the changes when data_complete is
> false, but some additional changes have been added to the same txn as
> the new changes might complete the tuple.
>
> > > > > , if so, then only send the
> > > > > changes, otherwise, we can pick the next largest transaction. I think
> > > > > we can retry it for few times and if we get the incomplete data for
> > > > > multiple transactions, then we can decide to spill the transaction or
> > > > > maybe we can directly spill the first largest transaction which has
> > > > > incomplete data.
> > > > >
> > > > Yeah, we might do something on this line. Basically, we need to mark
> > > > the top-transaction as data-incomplete if any of its subtransaction is
> > > > having data-incomplete (it will always be the latest sub-transaction
> > > > of the top transaction). Also, for streaming, we are checking the
> > > > largest top transaction whereas for spilling we just need the larget
> > > > (sub) transaction. So we also need to decide while picking the
> > > > largest top transaction for streaming, if we get a few transactions
> > > > with in-complete data then how we will go for the spill. Do we spill
> > > > all the sub-transactions under this top transaction or we will again
> > > > find the larget (sub) transaction for spilling.
> > > >
> > >
> > > I think it is better to do later as that will lead to the spill of
> > > only required (minimum changes to get the memory below threshold)
> > > changes.
> > I think instead of doing this can't we just spill the changes which
> > are in toast_hash. Basically, at the end of the stream, we have some
> > toast tuple which we could not stream because we did not have the
> > insert for the main table then we can spill only those changes which
> > are in tuple hash.
> >
>
> Hmm, I think this can turn out to be inefficient because we can easily
> end up spilling the data even when we don't need to so. Consider
> cases, where part of the streamed changes are for toast, and remaining
> are the changes which we would have streamed and hence can be removed.
> In such cases, we could have easily consumed remaining changes for
> toast without spilling. Also, I am not sure if spilling changes from
> the hash table is a good idea as they are no more in the same order as
> they were in ReorderBuffer which means the order in which we serialize
> the changes normally would change and that might have some impact, so
> we would need some more study if we want to pursue this idea.
I have fixed this bug and attached it as a separate patch. I will
merge it to the main patch after we agree with the idea and after some
more testing.
The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes. So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks. If the stream is over and we haven't got the
changes for the main table, that time we will mark the txn that it has
some pending toast changes so that next time we will not pick the same
transaction for the streaming. This flag will be cleaned whenever we
get any changes for the txn (insert or /update). There is also a
possibility that even after we stream the changes the rb->size is not
below logical_decoding_work_mem because we could not stream the
changes so for handling this after streaming we recheck the size and
if it is still not under control then we pick another transaction. In
some cases, we might not get any transaction to stream because the
transaction has the pending toast change flag set, In this case, we
will go for the spill.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-22 05:10:40 |
Message-ID: | CAFiTN-vNOGXwq7=QG=wdPez88sBVurXfmoo+d54D16owwaLndQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
Update on the open items
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3]. -> The first part is done and the second part is an improvement, not a bugfix. I will try to work on this part in the next patch set.
> b. Complete the handling of schema_sent as discussed above [4]. -> Done
> c. Few comments by Vignesh and the response on the same by me [5][6]. -> Done
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset. -> Pending
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7]. ->Pending
f. Bug in the toast table handling -> Submitted as a separate POC
patch, which can be merged to the main after review and more testing.
> [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
> [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
> [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
> [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-22 16:37:36 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions). Attached is Dilip's latest version,
plus my review changes. This will break the patch tester's logic; sorry
about that.
What part of this change is what sets the process's
logical_decoding_work_mem to the given value? I was unable to figure
that out. Is it missing or am I just stupid?
Changes:
* the patch adds logical_decoding_work_mem SGML, but that has already
been applied (cec2edfa7859); remove dupe.
* parse_subscription_options() comment says that it will raise an error if a
caller does not pass the pointer for an option but option list
specifies that option. It does not really implement that behavior (an
existing problem): instead, if the pointer is not passed, the option
is ignored. Moreover, this new patch continued to fail to handle
things as the comment says. I decided to implement the documented
behavior instead; it's now inconsistent with how the other options are
implemented. I think we should fix the other options to behave as the
comment says, because it's a more convenient API; if we instead opted
to update the code comment to match the code, each caller would have
to be checked to verify that the correct options are passed, which is
pointless and error prone.
* the parse_subscription_options API is a mess. I reordered the
arguments a little bit; also change the argument layout in callers so
that each caller is grouped more sensibly. Also added comments to
simplify reading the argument lists. I think this could be fixed by
using an ad-hoc struct to pass in and out. Didn't get around to doing
that, seems an unrelated potential improvement.
* trying to do own range checking in pgoutput and subscriptioncmds.c
seems pointless and likely to get out of sync with guc.c. Simpler is
to call set_config_option() to verify that the argument is in range.
(Note a further problem in the patch series: the range check in
subscriptioncmds.c is only added in patch 0009).
* parsing integers using scanint8() seemed weird (error messages there
do not correspond to what we want). After a couple of false starts, I
decided to rely on guc.c's set_config_option() followed by parse_int().
That also has the benefit that you can give it units.
* psql \dRs+ should display the work_mem; patch failed to do that.
Added. Unit display is done by pg_size_pretty(), which might be
different from what guc.c does, but I think it works OK.
It's the first place where we use pg_size_pretty to show a memory
limit, however.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Dilip-s-original.patch | text/x-diff | 13.0 KB |
0002-Changes-by-lvaro.patch | text/x-diff | 9.9 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-23 03:37:23 |
Message-ID: | CAA4eK1LH7xzF+-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
>
> I looked at this patchset and it seemed natural to apply 0008 next
> (adding work_mem to subscriptions).
>
I am not so sure whether we need this patch as the exact scenario
where it can help is not very clear to me and neither did anyone
explained. I have raised this concern earlier as well [1]. The point
is that 'logical_decoding_work_mem' applies to the entire
ReorderBuffer in the publisher's side and how will a parameter from a
particular subscription help in that?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 05:57:47 |
Message-ID: | CAA4eK1Li8L8-Ja-5d9DixV7Mwk6HJW3Z4rOzk0hYJ7PqY5XC-A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Hmm, I think this can turn out to be inefficient because we can easily
> > end up spilling the data even when we don't need to so. Consider
> > cases, where part of the streamed changes are for toast, and remaining
> > are the changes which we would have streamed and hence can be removed.
> > In such cases, we could have easily consumed remaining changes for
> > toast without spilling. Also, I am not sure if spilling changes from
> > the hash table is a good idea as they are no more in the same order as
> > they were in ReorderBuffer which means the order in which we serialize
> > the changes normally would change and that might have some impact, so
> > we would need some more study if we want to pursue this idea.
> I have fixed this bug and attached it as a separate patch. I will
> merge it to the main patch after we agree with the idea and after some
> more testing.
>
> The idea is that whenever we get the toasted chunk instead of directly
> inserting it into the toast hash I am inserting it into some local
> list so that if we don't get the change for the main table then we can
> insert these changes back to the txn->changes. So once we get the
> change for the main table at that time I am preparing the hash table
> to merge the chunks.
>
I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.
It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed. One idea to make
it work is that we identify it while decoding the WAL. I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation. This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient. If this is feasible, then we can
do the same for speculative insertions.
In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
below change required?
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
wal_level = logical
max_replication_slots = 4
logical_decoding_work_mem = 64kB
+logging_collector=on
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 06:04:42 |
Message-ID: | CAFiTN-urib9G+C9QciYeiuC_qRvxusXFO5uDMUcyGQUaSRotdQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Hmm, I think this can turn out to be inefficient because we can easily
> > > end up spilling the data even when we don't need to so. Consider
> > > cases, where part of the streamed changes are for toast, and remaining
> > > are the changes which we would have streamed and hence can be removed.
> > > In such cases, we could have easily consumed remaining changes for
> > > toast without spilling. Also, I am not sure if spilling changes from
> > > the hash table is a good idea as they are no more in the same order as
> > > they were in ReorderBuffer which means the order in which we serialize
> > > the changes normally would change and that might have some impact, so
> > > we would need some more study if we want to pursue this idea.
> > I have fixed this bug and attached it as a separate patch. I will
> > merge it to the main patch after we agree with the idea and after some
> > more testing.
> >
> > The idea is that whenever we get the toasted chunk instead of directly
> > inserting it into the toast hash I am inserting it into some local
> > list so that if we don't get the change for the main table then we can
> > insert these changes back to the txn->changes. So once we get the
> > change for the main table at that time I am preparing the hash table
> > to merge the chunks.
> >
>
>
> I think this idea will work but appears to be quite costly because (a)
> you might need to serialize/deserialize the changes multiple times and
> might attempt streaming multiple times even though you can't do (b)
> you need to remove/add the same set of changes from the main list
> multiple times.
I agree with this.
>
> It seems to me that we need to add all of this new handling because
> while taking the decision whether to stream or not we don't know
> whether the txn has changes that can't be streamed. One idea to make
> it work is that we identify it while decoding the WAL. I think we
> need to set a bit in the insert/delete WAL record to identify if the
> tuple belongs to a toast relation. This won't add any additional
> overhead in WAL and reduce a lot of complexity in the logical decoding
> and also decoding will be efficient. If this is feasible, then we can
> do the same for speculative insertions.
The Idea looks good to me. I will work on this.
>
> In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
> below change required?
>
> --- a/contrib/test_decoding/logical.conf
> +++ b/contrib/test_decoding/logical.conf
> @@ -1,3 +1,4 @@
> wal_level = logical
> max_replication_slots = 4
> logical_decoding_work_mem = 64kB
> +logging_collector=on
Sorry, these are some local changes which got included in the patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 06:12:58 |
Message-ID: | CAA4eK1+UA1UJrfw=12gXwhDW=oTc2L-o8Hjp2m20Kt2MNeP=jQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > end up spilling the data even when we don't need to so. Consider
> > > > cases, where part of the streamed changes are for toast, and remaining
> > > > are the changes which we would have streamed and hence can be removed.
> > > > In such cases, we could have easily consumed remaining changes for
> > > > toast without spilling. Also, I am not sure if spilling changes from
> > > > the hash table is a good idea as they are no more in the same order as
> > > > they were in ReorderBuffer which means the order in which we serialize
> > > > the changes normally would change and that might have some impact, so
> > > > we would need some more study if we want to pursue this idea.
> > > I have fixed this bug and attached it as a separate patch. I will
> > > merge it to the main patch after we agree with the idea and after some
> > > more testing.
> > >
> > > The idea is that whenever we get the toasted chunk instead of directly
> > > inserting it into the toast hash I am inserting it into some local
> > > list so that if we don't get the change for the main table then we can
> > > insert these changes back to the txn->changes. So once we get the
> > > change for the main table at that time I am preparing the hash table
> > > to merge the chunks.
> > >
> >
> >
> > I think this idea will work but appears to be quite costly because (a)
> > you might need to serialize/deserialize the changes multiple times and
> > might attempt streaming multiple times even though you can't do (b)
> > you need to remove/add the same set of changes from the main list
> > multiple times.
> I agree with this.
> >
> > It seems to me that we need to add all of this new handling because
> > while taking the decision whether to stream or not we don't know
> > whether the txn has changes that can't be streamed. One idea to make
> > it work is that we identify it while decoding the WAL. I think we
> > need to set a bit in the insert/delete WAL record to identify if the
> > tuple belongs to a toast relation. This won't add any additional
> > overhead in WAL and reduce a lot of complexity in the logical decoding
> > and also decoding will be efficient. If this is feasible, then we can
> > do the same for speculative insertions.
> The Idea looks good to me. I will work on this.
>
One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it. However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple. Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 06:28:02 |
Message-ID: | CAFiTN-vm0BKNmN77L-2+CvQ3xRfg22mhGH8y89RbRjyJjbJ5PA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > >
> > > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > > end up spilling the data even when we don't need to so. Consider
> > > > > cases, where part of the streamed changes are for toast, and remaining
> > > > > are the changes which we would have streamed and hence can be removed.
> > > > > In such cases, we could have easily consumed remaining changes for
> > > > > toast without spilling. Also, I am not sure if spilling changes from
> > > > > the hash table is a good idea as they are no more in the same order as
> > > > > they were in ReorderBuffer which means the order in which we serialize
> > > > > the changes normally would change and that might have some impact, so
> > > > > we would need some more study if we want to pursue this idea.
> > > > I have fixed this bug and attached it as a separate patch. I will
> > > > merge it to the main patch after we agree with the idea and after some
> > > > more testing.
> > > >
> > > > The idea is that whenever we get the toasted chunk instead of directly
> > > > inserting it into the toast hash I am inserting it into some local
> > > > list so that if we don't get the change for the main table then we can
> > > > insert these changes back to the txn->changes. So once we get the
> > > > change for the main table at that time I am preparing the hash table
> > > > to merge the chunks.
> > > >
> > >
> > >
> > > I think this idea will work but appears to be quite costly because (a)
> > > you might need to serialize/deserialize the changes multiple times and
> > > might attempt streaming multiple times even though you can't do (b)
> > > you need to remove/add the same set of changes from the main list
> > > multiple times.
> > I agree with this.
> > >
> > > It seems to me that we need to add all of this new handling because
> > > while taking the decision whether to stream or not we don't know
> > > whether the txn has changes that can't be streamed. One idea to make
> > > it work is that we identify it while decoding the WAL. I think we
> > > need to set a bit in the insert/delete WAL record to identify if the
> > > tuple belongs to a toast relation. This won't add any additional
> > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > and also decoding will be efficient. If this is feasible, then we can
> > > do the same for speculative insertions.
> > The Idea looks good to me. I will work on this.
> >
>
> One more thing we can do is to identify whether the tuple belongs to
> toast relation while decoding it. However, I think to do that we need
> to have access to relcache at that time and that might add some
> overhead as we need to do that for each tuple. Can we investigate
> what it will take to do that and if it is better than setting a bit
> during WAL logging.
IMHO, for the catalog scan, we will have to start/stop the transaction
for each change. So do you want that we should evaluate its
performance? Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 08:00:29 |
Message-ID: | CAA4eK1+_kLwm_hCKJSHfK5u8rkguMNRh1+MCSi1J8DrHzhAPSg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > > > It seems to me that we need to add all of this new handling because
> > > > while taking the decision whether to stream or not we don't know
> > > > whether the txn has changes that can't be streamed. One idea to make
> > > > it work is that we identify it while decoding the WAL. I think we
> > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > tuple belongs to a toast relation. This won't add any additional
> > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > and also decoding will be efficient. If this is feasible, then we can
> > > > do the same for speculative insertions.
> > > The Idea looks good to me. I will work on this.
> > >
> >
> > One more thing we can do is to identify whether the tuple belongs to
> > toast relation while decoding it. However, I think to do that we need
> > to have access to relcache at that time and that might add some
> > overhead as we need to do that for each tuple. Can we investigate
> > what it will take to do that and if it is better than setting a bit
> > during WAL logging.
>
> IMHO, for the catalog scan, we will have to start/stop the transaction
> for each change. So do you want that we should evaluate its
> performance?
>
No, I was not thinking about each change, but at the level of ReorderBufferTXN.
> Also, during we get the change we might not have the
> complete historic snapshot ready to fetch the rel cache entry.
>
Before decoding each change (say DecodeInsert), we call
SnapBuildProcessChange. Isn't that sufficient?
Even, if the above is possible, I am not sure how good is it for each
change we fetch rel cache entry, that is the point I was worried.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 08:25:11 |
Message-ID: | CAFiTN-v_B9-_BC1_45kUyhpCA_t08fAw1uicKq_McnggNWZP5A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > > > It seems to me that we need to add all of this new handling because
> > > > > while taking the decision whether to stream or not we don't know
> > > > > whether the txn has changes that can't be streamed. One idea to make
> > > > > it work is that we identify it while decoding the WAL. I think we
> > > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > > tuple belongs to a toast relation. This won't add any additional
> > > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > > and also decoding will be efficient. If this is feasible, then we can
> > > > > do the same for speculative insertions.
> > > > The Idea looks good to me. I will work on this.
> > > >
> > >
> > > One more thing we can do is to identify whether the tuple belongs to
> > > toast relation while decoding it. However, I think to do that we need
> > > to have access to relcache at that time and that might add some
> > > overhead as we need to do that for each tuple. Can we investigate
> > > what it will take to do that and if it is better than setting a bit
> > > during WAL logging.
> >
> > IMHO, for the catalog scan, we will have to start/stop the transaction
> > for each change. So do you want that we should evaluate its
> > performance?
> >
>
> No, I was not thinking about each change, but at the level of ReorderBufferTXN.
That means we will have to keep that transaction open until we decode
the commit WAL for that ReorderBufferTXN or you have anything else in
mind?
>
> > Also, during we get the change we might not have the
> > complete historic snapshot ready to fetch the rel cache entry.
> >
>
> Before decoding each change (say DecodeInsert), we call
> SnapBuildProcessChange. Isn't that sufficient?
Yeah, Right, we can get some recache entry based on the base snapshot.
And, that might be sufficient to know whether it's a toast relation or
not.
>
> Even, if the above is possible, I am not sure how good is it for each
> change we fetch rel cache entry, that is the point I was worried.
We might not need to scan the catalog every time, we might get it from
the cache itself.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-28 09:06:16 |
Message-ID: | CAA4eK1Lpf7zLnGG8m15TGyQxkFyV0WGTWS+NrXNg1bi5nMfwUQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 1:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > > > It seems to me that we need to add all of this new handling because
> > > > > > while taking the decision whether to stream or not we don't know
> > > > > > whether the txn has changes that can't be streamed. One idea to make
> > > > > > it work is that we identify it while decoding the WAL. I think we
> > > > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > > > tuple belongs to a toast relation. This won't add any additional
> > > > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > > > and also decoding will be efficient. If this is feasible, then we can
> > > > > > do the same for speculative insertions.
> > > > > The Idea looks good to me. I will work on this.
> > > > >
> > > >
> > > > One more thing we can do is to identify whether the tuple belongs to
> > > > toast relation while decoding it. However, I think to do that we need
> > > > to have access to relcache at that time and that might add some
> > > > overhead as we need to do that for each tuple. Can we investigate
> > > > what it will take to do that and if it is better than setting a bit
> > > > during WAL logging.
> > >
> > > IMHO, for the catalog scan, we will have to start/stop the transaction
> > > for each change. So do you want that we should evaluate its
> > > performance?
> > >
> >
> > No, I was not thinking about each change, but at the level of ReorderBufferTXN.
> That means we will have to keep that transaction open until we decode
> the commit WAL for that ReorderBufferTXN or you have anything else in
> mind?
>
or probably till we start streaming.
> >
> > > Also, during we get the change we might not have the
> > > complete historic snapshot ready to fetch the rel cache entry.
> > >
> >
> > Before decoding each change (say DecodeInsert), we call
> > SnapBuildProcessChange. Isn't that sufficient?
> Yeah, Right, we can get some recache entry based on the base snapshot.
> And, that might be sufficient to know whether it's a toast relation or
> not.
> >
> > Even, if the above is possible, I am not sure how good is it for each
> > change we fetch rel cache entry, that is the point I was worried.
>
> We might not need to scan the catalog every time, we might get it from
> the cache itself.
>
Right, but I am not completely sure if that is better than setting a
bit in WAL record for toast tuples.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-30 10:41:44 |
Message-ID: | CAA4eK1KbvOAiUnQQxtXAXSQAAroU9h0msp6KXrysYOEVHb34vA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
>
> >
> > Few more comments:
> > --------------------------------
> > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > 1.
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > + /*
> > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > + * information about
> > subtransactions, which could arrive after streaming start.
> > + */
> > + if (!txn->is_schema_sent)
> > + snapshot_now
> > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > + txn,
> > command_id);
> > ..
> > }
> >
> > Why are we using base snapshot here instead of the snapshot we saved
> > the first time streaming has happened? And as mentioned in comments,
> > won't we need to consider the snapshots for subtransactions that
> > arrived after the last time we have streamed the changes?
> Fixed
>
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * We can not use txn->snapshot_now directly because after we there
+ * might be some new sub-transaction which after the last streaming run
+ * so we need to add those sub-xip in the snapshot.
+ */
+ snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+ txn, command_id);
"because after we there", you seem to forget a word between 'we' and
'there'. So as we are copying it now, does this mean it will consider
the snapshots for subtransactions that arrived after the last time we
have streamed the changes? If so, have you tested it and can we add
the same in comments.
Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).
{
..
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+ txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
..
}
> >
> > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> > fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> > especially to cover the case when it gets called due to memory
> > overflow (aka via ReorderBufferCheckMemoryLimit).
> We get origin_lsn during commit time so I am not sure how can we do
> that. I have also noticed that currently, we are not using origin_lsn
> on the subscriber side. I think need more investigation that if we
> want this then do we need to log it early.
>
Have you done any investigation of this point? You might want to look
at pg_replication_origin* APIs. Today, again looking at this code, I
think with current coding, it won't be used even when we encounter
commit record. Because ReorderBufferCommit calls
ReorderBufferStreamCommit which will make sure that origin_id and
origin_lsn is never sent. I think at least that should be fixed, if
not, probably, we need a comment with reasoning why we think it is
okay not to do in this case.
+ /*
+ * If we are streaming the in-progress transaction then Discard the
/Discard/discard
> >
> > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> > 1.
> > + /*
> > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> > + * error out
> > + */
> > + if (TransactionIdIsValid(CheckXidAlive) &&
> > + !TransactionIdIsInProgress(CheckXidAlive) &&
> > + !TransactionIdDidCommit(CheckXidAlive))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> > + errmsg("transaction aborted during system catalog scan")));
> >
> > Why here we can't use TransactionIdDidAbort? If we can't use it, then
> > can you add comments stating the reason of the same.
> Done
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out. Instead of directly checking the abort status we do check
+ * if it is not in progress transaction and no committed. Because if there
+ * were a system crash then status of the the transaction which were running
+ * at that time might not have marked. So we need to consider them as
+ * aborted. Refer detailed comments at snapmgr.c where the variable is
+ * declared.
How about replacing the above comment with below one:
If CheckXidAlive is valid, then we check if it aborted. If it did, we
error out. We can't directly use TransactionIdDidAbort as after crash
such transaction might not have been marked as aborted. See detailed
comments at snapmgr.c where the variable is declared.
I am not able to understand the change in
v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
any explanation for the same?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-30 12:39:52 |
Message-ID: | CAFiTN-uJM0m8Nj__+ZRj+_t06NZKyk117v15HsErL4sshXkT9Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> >
> > >
> > > Few more comments:
> > > --------------------------------
> > > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > > 1.
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > > + * information about
> > > subtransactions, which could arrive after streaming start.
> > > + */
> > > + if (!txn->is_schema_sent)
> > > + snapshot_now
> > > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > > + txn,
> > > command_id);
> > > ..
> > > }
> > >
> > > Why are we using base snapshot here instead of the snapshot we saved
> > > the first time streaming has happened? And as mentioned in comments,
> > > won't we need to consider the snapshots for subtransactions that
> > > arrived after the last time we have streamed the changes?
> > Fixed
> >
>
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * We can not use txn->snapshot_now directly because after we there
> + * might be some new sub-transaction which after the last streaming run
> + * so we need to add those sub-xip in the snapshot.
> + */
> + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
> + txn, command_id);
>
> "because after we there", you seem to forget a word between 'we' and
> 'there'. So as we are copying it now, does this mean it will consider
> the snapshots for subtransactions that arrived after the last time we
> have streamed the changes? If so, have you tested it and can we add
> the same in comments.
Ok
> Also, if we need to copy the snapshot here, then do we need to again
> copy it in ReorderBufferProcessTXN(in below code and in catch block in
> the same function).
I think so because as part of the
"REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
point to the snapshot and that will get truncated when we truncate all
the changes of the ReorderBufferTXN. So I think we can check if
snapshot_now->copied is true then we can avoid copying otherwise we
can copy?
Other comments look fine to me so I will reply to them along with the
next version of the patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-01-31 02:38:01 |
Message-ID: | CAA4eK1KXf1ACLpAxQv6oFj_L9JEFjdJpPstuduD7b4zXPMsD5g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > Also, if we need to copy the snapshot here, then do we need to again
> > copy it in ReorderBufferProcessTXN(in below code and in catch block in
> > the same function).
> I think so because as part of the
> "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
> point to the snapshot and that will get truncated when we truncate all
> the changes of the ReorderBufferTXN. So I think we can check if
> snapshot_now->copied is true then we can avoid copying otherwise we
> can copy?
>
Yeah, that makes sense, but I think then we also need to ensure that
ReorderBufferStreamTXN frees the snapshot only when it is copied. It
seems to me it should be always copied in the place where we are
trying to free it, so probably we should have an Assert there.
One more thing:
ReorderBufferProcessTXN()
{
..
+ if (streaming)
+ {
+ /*
+ * While streaming an in-progress transaction there is a
+ * possibility that the (sub)transaction might get aborted
+ * concurrently. In such case if the (sub)transaction has
+ * catalog update then we might decode the tuple using wrong
+ * catalog version. So for detecting the concurrent abort we
+ * set CheckXidAlive to the current (sub)transaction's xid for
+ * which this change belongs to. And, during catalog scan we
+ * can check the status of the xid and if it is aborted we will
+ * report an specific error which we can ignore. We might have
+ * already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the
+ * abort we will stream abort message to truncate the changes in
+ * the subscriber.
+ */
+ CheckXidAlive = change->txn->xid;
+ }
..
}
I think it is better to move the above code into an inline function
(something like SetXidAlive). It will make the code in function
ReorderBufferProcessTXN look cleaner and easier to understand.
> Other comments look fine to me so I will reply to them along with the
> next version of the patch.
>
Okay, thanks.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-01 05:35:42 |
Message-ID: | CAA4eK1+Y=3vwuunhYvJCop2MKi4R4LPx781Cgt+yT5zWNLFxMA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> Other comments look fine to me so I will reply to them along with the
> next version of the patch.
>
This still needs more work, so I have moved this to the next CF.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-03 04:21:12 |
Message-ID: | CAA4eK1JfKxj7fpaUAO3eqjRU54wUyRJ8TaFaVK8aQRw24vyP6g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > > 2. During commit time in DecodeCommit we check whether we need to skip
> > > the changes of the transaction or not by calling
> > > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > > possible that before we decode the commit WAL, we might have already
> > > sent the changes to the output plugin even though we could have
> > > skipped those changes. So my question is instead of checking at the
> > > commit time can't we check before adding to ReorderBuffer itself
> > >
> >
> > I think if we can do that then the same will be true for current code
> > irrespective of this patch. I think it is possible that we can't take
> > that decision while decoding because we haven't assembled a consistent
> > snapshot yet. I think we might be able to do that while we try to
> > stream the changes. I think we need to take care of all the
> > conditions during streaming (when the logical_decoding_workmem limit
> > is reached) as we do in DecodeCommit. This needs a bit more study.
>
> I have analyzed this further and I think we can not decide all the
> conditions even while streaming. Because IMHO once we get the
> SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
> so that if we get the commit of the transaction after we reach to the
> SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
> to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
> even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
> which might get dropped later but that we can not decide while
> streaming.
>
This makes sense to me, but we should add a comment for the same when
we are streaming to say we can't skip similar to how we do during
commit time because of the above reason described by you. Also, what
about other conditions where we can skip the transaction, basically
cases like (a) when the transaction happened in another database, (b)
when the output plugin is not interested in the origin and (c) when we
are doing fast-forwarding
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-04 05:20:39 |
Message-ID: | CAFiTN-uAprGM5b=ydAj_2m_Yk0MYKCNN6crH9NbWHAeCAN8+RA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Feb 3, 2020 at 9:51 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > > 2. During commit time in DecodeCommit we check whether we need to skip
> > > > the changes of the transaction or not by calling
> > > > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > > > possible that before we decode the commit WAL, we might have already
> > > > sent the changes to the output plugin even though we could have
> > > > skipped those changes. So my question is instead of checking at the
> > > > commit time can't we check before adding to ReorderBuffer itself
> > > >
> > >
> > > I think if we can do that then the same will be true for current code
> > > irrespective of this patch. I think it is possible that we can't take
> > > that decision while decoding because we haven't assembled a consistent
> > > snapshot yet. I think we might be able to do that while we try to
> > > stream the changes. I think we need to take care of all the
> > > conditions during streaming (when the logical_decoding_workmem limit
> > > is reached) as we do in DecodeCommit. This needs a bit more study.
> >
> > I have analyzed this further and I think we can not decide all the
> > conditions even while streaming. Because IMHO once we get the
> > SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
> > so that if we get the commit of the transaction after we reach to the
> > SNAPBUILD_CONSISTENT. However, if we get the commit before we reach
> > to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now,
> > even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
> > which might get dropped later but that we can not decide while
> > streaming.
> >
>
> This makes sense to me, but we should add a comment for the same when
> we are streaming to say we can't skip similar to how we do during
> commit time because of the above reason described by you. Also, what
> about other conditions where we can skip the transaction, basically
> cases like (a) when the transaction happened in another database, (b)
> when the output plugin is not interested in the origin and (c) when we
> are doing fast-forwarding
I will analyze those and fix in my next version of the patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-04 05:29:51 |
Message-ID: | CAFiTN-uJTHgBogP-C8fXpuqSpWvmrgYRLFH=V0FH-hwQ75eXGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > >
> > > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > > end up spilling the data even when we don't need to so. Consider
> > > > > cases, where part of the streamed changes are for toast, and remaining
> > > > > are the changes which we would have streamed and hence can be removed.
> > > > > In such cases, we could have easily consumed remaining changes for
> > > > > toast without spilling. Also, I am not sure if spilling changes from
> > > > > the hash table is a good idea as they are no more in the same order as
> > > > > they were in ReorderBuffer which means the order in which we serialize
> > > > > the changes normally would change and that might have some impact, so
> > > > > we would need some more study if we want to pursue this idea.
> > > > I have fixed this bug and attached it as a separate patch. I will
> > > > merge it to the main patch after we agree with the idea and after some
> > > > more testing.
> > > >
> > > > The idea is that whenever we get the toasted chunk instead of directly
> > > > inserting it into the toast hash I am inserting it into some local
> > > > list so that if we don't get the change for the main table then we can
> > > > insert these changes back to the txn->changes. So once we get the
> > > > change for the main table at that time I am preparing the hash table
> > > > to merge the chunks.
> > > >
> > >
> > >
> > > I think this idea will work but appears to be quite costly because (a)
> > > you might need to serialize/deserialize the changes multiple times and
> > > might attempt streaming multiple times even though you can't do (b)
> > > you need to remove/add the same set of changes from the main list
> > > multiple times.
> > I agree with this.
> > >
> > > It seems to me that we need to add all of this new handling because
> > > while taking the decision whether to stream or not we don't know
> > > whether the txn has changes that can't be streamed. One idea to make
> > > it work is that we identify it while decoding the WAL. I think we
> > > need to set a bit in the insert/delete WAL record to identify if the
> > > tuple belongs to a toast relation. This won't add any additional
> > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > and also decoding will be efficient. If this is feasible, then we can
> > > do the same for speculative insertions.
> > The Idea looks good to me. I will work on this.
> >
>
> One more thing we can do is to identify whether the tuple belongs to
> toast relation while decoding it. However, I think to do that we need
> to have access to relcache at that time and that might add some
> overhead as we need to do that for each tuple. Can we investigate
> what it will take to do that and if it is better than setting a bit
> during WAL logging.
>
I have done some more analysis on this and it appears that there are
few problems in doing this. Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple. So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed, and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-05 03:57:41 |
Message-ID: | CAA4eK1KMSuED6u6f9+RhmAP+5Gns6S8aWVtAfnHA55v22q_dZQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > One more thing we can do is to identify whether the tuple belongs to
> > toast relation while decoding it. However, I think to do that we need
> > to have access to relcache at that time and that might add some
> > overhead as we need to do that for each tuple. Can we investigate
> > what it will take to do that and if it is better than setting a bit
> > during WAL logging.
> >
> I have done some more analysis on this and it appears that there are
> few problems in doing this. Basically, once we get the confirmed
> flush location, we advance the replication_slot_catalog_xmin so that
> vacuum can garbage collect the old tuple. So the problem is that
> while we are collecting the changes in the ReorderBuffer our catalog
> version might have removed, and we might not find any relation entry
> with that relfilenodeid (because it is dropped or altered in the
> future).
>
Hmm, this means this can also occur while streaming the changes. The
main reason as I understand is that it is because before decoding
commit, we don't know whether these changes are already sent to the
subscriber (based on confirmed_flush_location/start_decoding_at). I
think it is better to skip streaming such transactions as we can't
make the right decision about these and as this can happen generally
after the crash for the first few transactions, it shouldn't matter
much if we serialize such transactions instead of streaming them.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-05 04:12:10 |
Message-ID: | CAFiTN-skHvSWDHV66qpzMfnHH6AvsE2YAjvh4Kt613E8ZD8WoQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> >
> > >
> > > Few more comments:
> > > --------------------------------
> > > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > > 1.
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > > + * information about
> > > subtransactions, which could arrive after streaming start.
> > > + */
> > > + if (!txn->is_schema_sent)
> > > + snapshot_now
> > > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > > + txn,
> > > command_id);
> > > ..
> > > }
> > >
> > > Why are we using base snapshot here instead of the snapshot we saved
> > > the first time streaming has happened? And as mentioned in comments,
> > > won't we need to consider the snapshots for subtransactions that
> > > arrived after the last time we have streamed the changes?
> > Fixed
> >
>
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * We can not use txn->snapshot_now directly because after we there
> + * might be some new sub-transaction which after the last streaming run
> + * so we need to add those sub-xip in the snapshot.
> + */
> + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
> + txn, command_id);
>
> "because after we there", you seem to forget a word between 'we' and
> 'there'.
Fixed
So as we are copying it now, does this mean it will consider
> the snapshots for subtransactions that arrived after the last time we
> have streamed the changes? If so, have you tested it and can we add
> the same in comments.
Yes I have tested. Comment added.
>
> Also, if we need to copy the snapshot here, then do we need to again
> copy it in ReorderBufferProcessTXN(in below code and in catch block in
> the same function).
>
> {
> ..
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> + txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> ..
> }
>
Fixed
> > >
> > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> > > fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> > > especially to cover the case when it gets called due to memory
> > > overflow (aka via ReorderBufferCheckMemoryLimit).
> > We get origin_lsn during commit time so I am not sure how can we do
> > that. I have also noticed that currently, we are not using origin_lsn
> > on the subscriber side. I think need more investigation that if we
> > want this then do we need to log it early.
> >
>
> Have you done any investigation of this point? You might want to look
> at pg_replication_origin* APIs. Today, again looking at this code, I
> think with current coding, it won't be used even when we encounter
> commit record. Because ReorderBufferCommit calls
> ReorderBufferStreamCommit which will make sure that origin_id and
> origin_lsn is never sent. I think at least that should be fixed, if
> not, probably, we need a comment with reasoning why we think it is
> okay not to do in this case.
Still, the problem is the same because, currently, we are sending
origin_lsn as part of the "pgoutput_begin" message. Now, for the
streaming transaction,
we have already sent the stream start. However, we might send this
during the stream commit, but I am not completely sure because
currently,
the consumer of this message "apply_handle_origin" is just ignoring
it. I have also looked into pg_replication_origin* APIs and they are
used for setting origin id and
tracking the progress, but they will not consume the origin_lsn we are
sending in pgoutput_begin so this is not directly related.
>
> + /*
> + * If we are streaming the in-progress transaction then Discard the
>
> /Discard/discard
Done
>
> > >
> > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > 1.
> > > + /*
> > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> > > + * error out
> > > + */
> > > + if (TransactionIdIsValid(CheckXidAlive) &&
> > > + !TransactionIdIsInProgress(CheckXidAlive) &&
> > > + !TransactionIdDidCommit(CheckXidAlive))
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> > > + errmsg("transaction aborted during system catalog scan")));
> > >
> > > Why here we can't use TransactionIdDidAbort? If we can't use it, then
> > > can you add comments stating the reason of the same.
> > Done
>
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out. Instead of directly checking the abort status we do check
> + * if it is not in progress transaction and no committed. Because if there
> + * were a system crash then status of the the transaction which were running
> + * at that time might not have marked. So we need to consider them as
> + * aborted. Refer detailed comments at snapmgr.c where the variable is
> + * declared.
>
>
> How about replacing the above comment with below one:
>
> If CheckXidAlive is valid, then we check if it aborted. If it did, we
> error out. We can't directly use TransactionIdDidAbort as after crash
> such transaction might not have been marked as aborted. See detailed
> comments at snapmgr.c where the variable is declared.
Done
>
> I am not able to understand the change in
> v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
> any explanation for the same?
It appears that in ReorderBufferCommitChild we are always setting the
final_lsn of the subxacts so it should not be invalid. For testing, I
have changed this as an assert and checked but it never hit. So maybe
we can remove this change.
Apart from that, I have fixed the toast tuple streaming bug by setting
the flag bit in the WAL (attached as 0012). I have also extended this
solution for handling the speculative insert bug so old patch for a
speculative insert bug fix is removed. I am also exploring the
solution that how can we do this without setting the flag in the WAL
as we discussed upthread.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-05 04:15:47 |
Message-ID: | CAFiTN-uNKhtscqUvHokdxykM8w4J5aLEgSx+_Xx2O-sBdfc=tA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Jan 31, 2020 at 8:08 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > Also, if we need to copy the snapshot here, then do we need to again
> > > copy it in ReorderBufferProcessTXN(in below code and in catch block in
> > > the same function).
> > I think so because as part of the
> > "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
> > point to the snapshot and that will get truncated when we truncate all
> > the changes of the ReorderBufferTXN. So I think we can check if
> > snapshot_now->copied is true then we can avoid copying otherwise we
> > can copy?
> >
>
> Yeah, that makes sense, but I think then we also need to ensure that
> ReorderBufferStreamTXN frees the snapshot only when it is copied. It
> seems to me it should be always copied in the place where we are
> trying to free it, so probably we should have an Assert there.
>
> One more thing:
> ReorderBufferProcessTXN()
> {
> ..
> + if (streaming)
> + {
> + /*
> + * While streaming an in-progress transaction there is a
> + * possibility that the (sub)transaction might get aborted
> + * concurrently. In such case if the (sub)transaction has
> + * catalog update then we might decode the tuple using wrong
> + * catalog version. So for detecting the concurrent abort we
> + * set CheckXidAlive to the current (sub)transaction's xid for
> + * which this change belongs to. And, during catalog scan we
> + * can check the status of the xid and if it is aborted we will
> + * report an specific error which we can ignore. We might have
> + * already streamed some of the changes for the aborted
> + * (sub)transaction, but that is fine because when we decode the
> + * abort we will stream abort message to truncate the changes in
> + * the subscriber.
> + */
> + CheckXidAlive = change->txn->xid;
> + }
> ..
> }
>
> I think it is better to move the above code into an inline function
> (something like SetXidAlive). It will make the code in function
> ReorderBufferProcessTXN look cleaner and easier to understand.
>
Fixed in the latest version sent upthread.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-05 04:18:48 |
Message-ID: | CAFiTN-vZ2EHjQQaGBJEbf-NnkrFkTEj+VbjrGtrfv=ck7crd+w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Feb 5, 2020 at 9:27 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > One more thing we can do is to identify whether the tuple belongs to
> > > toast relation while decoding it. However, I think to do that we need
> > > to have access to relcache at that time and that might add some
> > > overhead as we need to do that for each tuple. Can we investigate
> > > what it will take to do that and if it is better than setting a bit
> > > during WAL logging.
> > >
> > I have done some more analysis on this and it appears that there are
> > few problems in doing this. Basically, once we get the confirmed
> > flush location, we advance the replication_slot_catalog_xmin so that
> > vacuum can garbage collect the old tuple. So the problem is that
> > while we are collecting the changes in the ReorderBuffer our catalog
> > version might have removed, and we might not find any relation entry
> > with that relfilenodeid (because it is dropped or altered in the
> > future).
> >
>
> Hmm, this means this can also occur while streaming the changes. The
> main reason as I understand is that it is because before decoding
> commit, we don't know whether these changes are already sent to the
> subscriber (based on confirmed_flush_location/start_decoding_at).
Right.
>I think it is better to skip streaming such transactions as we can't
> make the right decision about these and as this can happen generally
> after the crash for the first few transactions, it shouldn't matter
> much if we serialize such transactions instead of streaming them.
I think the idea makes sense to me.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-05 10:35:28 |
Message-ID: | CAA4eK1LT2UArE9FyPqZcmV2kALhEmcXnZGdreYWCdCJ9vRUd8Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> Fixed in the latest version sent upthread.
>
Okay, thanks. I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified. Here are my
comments:
I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.
v8-0008-Add-support-for-streaming-to-built-in-replication
1.
- information. The allowed options are <literal>slot_name</literal> and
- <literal>synchronous_commit</literal>
+ information. The allowed options are <literal>slot_name</literal>,
+ <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+ and <literal>streaming</literal>.
As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.
2.
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
*connect, bool *enabled_given,
bool *slot_name_given, char **slot_name,
bool *copy_data, char **synchronous_commit,
bool *refresh, int *logical_wm,
- bool *logical_wm_given)
+ bool *logical_wm_given, bool *streaming,
+ bool *streaming_given)
It is not clear to me why we need two parameters 'streaming' and
'streaming_given' in this API. Can't we handle similar to parameter
'refresh'?
3.
diff --git a/src/backend/replication/logical/launcher.c
b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
*
*-------------------------------------------------------------------------
*/
+#include <sys/types.h>
+#include <unistd.h>
#include "postgres.h"
I see only the above change in launcher.c. Why we need to include
these if there is no other change (at least not in this patch).
4.
stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "stream_start";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
errcallback.callback = output_plugin_error_callback;
errcallback.arg = (void *) &state;
errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn)
/* Push callback + info on the error context stack */
state.ctx = ctx;
state.callback_name = "stream_stop";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
errcallback.callback = output_plugin_error_callback;
errcallback.arg = (void *) &state;
errcallback.previous = error_context_stack;
Don't we want to set txn->final_lsn in report location as we do at few
other places?
5.
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+ Relation rel, HeapTuple oldtuple)
{
+ pq_sendbyte(out, 'D'); /* action DELETE */
+
Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
- pq_sendbyte(out, 'D'); /* action DELETE */
Why this patch need to change the above code?
6.
+void
+logicalrep_write_stream_start(StringInfo out,
+ TransactionId xid, bool first_segment)
+{
+ pq_sendbyte(out, 'S'); /* action STREAM START */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+
+ /* 1 if this is the first streaming segment for this xid */
+ pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+ TransactionId xid;
+
+ Assert(first_segment);
+
+ xid = pq_getmsgint(in, 4);
+ *first_segment = (pq_getmsgint(in, 4) == 1);
+
+ return xid;
+}
In these functions for sending bool, pq_sendint32 is used. Can't we
use pq_sendbyte similar to what we do in boolsend?
7.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}
In comments, 'starting to stream' is mentioned whereas this function
is to stop it.
8.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+ TransactionId xid;
+
+ xid = pq_getmsgint(in, 4);
+
+ return xid;
+}
Is there a reason to send xid on stopping stream? I don't see any use
of function logicalrep_read_stream_stop.
9.
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
..
+ pgstat_report_wait_end();
..
}
I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
this function, so not sure if the above comment makes sense.
10.
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription.
Are we keeping files in /tmp or pg's temp tablespace dir. Seeing
below code, it doesn't seem that we place them in /tmp. If I am
correct, then can you update the comment.
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
11.
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
..
+ */
+static void
+stream_write_change(char action, StringInfo s)
The part of the comment which says "with length (not including the
length) .." is not clear to me. What does "not including the length"
mean?
12.
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)
I think we can implement this TODO. It is clear when this function is
called from apply_handle_stream_commit, the file must exist. We can
similarly analyze other callers of this API.
13.
+apply_handle_stream_abort(StringInfo s)
{
..
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
..
I am not sure how important this optimization is, so instead of FIXME,
it is better to keep it as a XXX comment. In the future, if we hit
any performance issue due to this, we can revisit our decision.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-06 09:21:17 |
Message-ID: | CAA4eK1+M1jU8pWZYtdb35BxEkK3wC+8fZ7o5-yyAZmY7+HJHKw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Feb 5, 2020 at 9:42 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> >
> > I am not able to understand the change in
> > v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have
> > any explanation for the same?
>
> It appears that in ReorderBufferCommitChild we are always setting the
> final_lsn of the subxacts so it should not be invalid. For testing, I
> have changed this as an assert and checked but it never hit. So maybe
> we can remove this change.
>
Tomas, do you remember anything about this change? We are talking
about below change:
From: Tomas Vondra <tv(at)fuzzy(dot)cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup
---
src/backend/replication/logical/reorderbuffer.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index fe4e57c..beb6cd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
+ /* make sure subtxn has final_lsn */
+ if (subtxn->final_lsn == InvalidXLogRecPtr)
+ subtxn->final_lsn = txn->final_lsn;
+
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-07 10:48:15 |
Message-ID: | CAFiTN-tf91XNfmWmMmATPHdEqK+E_7ndiE4fnggedUdDt30P_w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > Fixed in the latest version sent upthread.
> >
>
> Okay, thanks. I haven't looked at the latest version of patch series
> as I was reviewing the previous version and I think all of these
> comments are in the patch which is not modified. Here are my
> comments:
>
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.
>
> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> - information. The allowed options are <literal>slot_name</literal> and
> - <literal>synchronous_commit</literal>
> + information. The allowed options are <literal>slot_name</literal>,
> + <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> + and <literal>streaming</literal>.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.
After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription, Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots? I am
not saying that the current way of implementing is the best one but
that we can improve. First, we need to decide whether we have a use
case for this or not. Please let me know your thought on the same.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-10 08:22:13 |
Message-ID: | CAA4eK1LkqFD9xrpUKn21rSvFJC86zd+_zL3R0WQWDQ3S6jjO6A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > Fixed in the latest version sent upthread.
> > >
> >
> > Okay, thanks. I haven't looked at the latest version of patch series
> > as I was reviewing the previous version and I think all of these
> > comments are in the patch which is not modified. Here are my
> > comments:
> >
> > I think we don't need to maintain
> > v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> > discussion in one of the above emails [1] as its usage is not clear.
> >
> > v8-0008-Add-support-for-streaming-to-built-in-replication
> > 1.
> > - information. The allowed options are <literal>slot_name</literal> and
> > - <literal>synchronous_commit</literal>
> > + information. The allowed options are <literal>slot_name</literal>,
> > + <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> > + and <literal>streaming</literal>.
> >
> > As per the discussion above [1], I don't think we need work_mem here.
> > You might want to remove the other usage from the patch as well.
>
> After putting more thought on this it appears that there could be some
> use cases for setting the work_mem from the subscription, Assume a
> case where data are coming from two different origins and based on the
> origin ids different slots might collect different type of changes,
> So isn't it good to have different work_mem for different slots? I am
> not saying that the current way of implementing is the best one but
> that we can improve. First, we need to decide whether we have a use
> case for this or not.
>
That is the whole point. I don't see a very clear usage of this and
neither did anybody explained clearly how it will be useful. I am not
denying that what you are describing has no use, but as you said we
might need to invent an entirely new way even if we have such a use.
I think it is better to avoid the requirements which are not essential
for this patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-10 09:39:12 |
Message-ID: | CAFiTN-t7ZQn=BGPOwTjFS90zWdafC1aHpBUYUJjsahHhGnRQ=A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Feb 10, 2020 at 1:52 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > Fixed in the latest version sent upthread.
> > > >
> > >
> > > Okay, thanks. I haven't looked at the latest version of patch series
> > > as I was reviewing the previous version and I think all of these
> > > comments are in the patch which is not modified. Here are my
> > > comments:
> > >
> > > I think we don't need to maintain
> > > v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> > > discussion in one of the above emails [1] as its usage is not clear.
> > >
> > > v8-0008-Add-support-for-streaming-to-built-in-replication
> > > 1.
> > > - information. The allowed options are <literal>slot_name</literal> and
> > > - <literal>synchronous_commit</literal>
> > > + information. The allowed options are <literal>slot_name</literal>,
> > > + <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> > > + and <literal>streaming</literal>.
> > >
> > > As per the discussion above [1], I don't think we need work_mem here.
> > > You might want to remove the other usage from the patch as well.
> >
> > After putting more thought on this it appears that there could be some
> > use cases for setting the work_mem from the subscription, Assume a
> > case where data are coming from two different origins and based on the
> > origin ids different slots might collect different type of changes,
> > So isn't it good to have different work_mem for different slots? I am
> > not saying that the current way of implementing is the best one but
> > that we can improve. First, we need to decide whether we have a use
> > case for this or not.
> >
>
> That is the whole point. I don't see a very clear usage of this and
> neither did anybody explained clearly how it will be useful. I am not
> denying that what you are describing has no use, but as you said we
> might need to invent an entirely new way even if we have such a use.
> I think it is better to avoid the requirements which are not essential
> for this patch.
Ok, I will include this change in the next patch set.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-11 03:12:29 |
Message-ID: | CAFiTN-t1YNmoBf7k1kUrUue4q1Tf3GXjGwZTseNLivKmr9sz1Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.
Done
> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> - information. The allowed options are <literal>slot_name</literal> and
> - <literal>synchronous_commit</literal>
> + information. The allowed options are <literal>slot_name</literal>,
> + <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> + and <literal>streaming</literal>.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.
Done
> 2.
> @@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
> *connect, bool *enabled_given,
> bool *slot_name_given, char **slot_name,
> bool *copy_data, char **synchronous_commit,
> bool *refresh, int *logical_wm,
> - bool *logical_wm_given)
> + bool *logical_wm_given, bool *streaming,
> + bool *streaming_given)
>
> It is not clear to me why we need two parameters 'streaming' and
> 'streaming_given' in this API. Can't we handle similar to parameter
> 'refresh'?
The streaming option we need to update in the system table, so if we
don't remember whether the user has given its value or not then how we
will know whether to update this column or not? Or you are suggesting
that we should always mark this as updated but IMHO that is not a good
idea.
> 3.
> diff --git a/src/backend/replication/logical/launcher.c
> b/src/backend/replication/logical/launcher.c
> index aec885e..e80d00c 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,6 +14,8 @@
> *
> *-------------------------------------------------------------------------
> */
> +#include <sys/types.h>
> +#include <unistd.h>
>
> #include "postgres.h"
>
> I see only the above change in launcher.c. Why we need to include
> these if there is no other change (at least not in this patch).
Removed
> 4.
> stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> /* Push callback + info on the error context stack */
> state.ctx = ctx;
> state.callback_name = "stream_start";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
> errcallback.callback = output_plugin_error_callback;
> errcallback.arg = (void *) &state;
> errcallback.previous = error_context_stack;
> @@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
> ReorderBufferTXN *txn)
> /* Push callback + info on the error context stack */
> state.ctx = ctx;
> state.callback_name = "stream_stop";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
> errcallback.callback = output_plugin_error_callback;
> errcallback.arg = (void *) &state;
> errcallback.previous = error_context_stack;
>
> Don't we want to set txn->final_lsn in report location as we do at few
> other places?
Fixed
> 5.
> -logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
> +logicalrep_write_delete(StringInfo out, TransactionId xid,
> + Relation rel, HeapTuple oldtuple)
> {
> + pq_sendbyte(out, 'D'); /* action DELETE */
> +
> Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
> rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
> rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
>
> - pq_sendbyte(out, 'D'); /* action DELETE */
>
> Why this patch need to change the above code?
Fixed
> 6.
> +void
> +logicalrep_write_stream_start(StringInfo out,
> + TransactionId xid, bool first_segment)
> +{
> + pq_sendbyte(out, 'S'); /* action STREAM START */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +
> + /* 1 if this is the first streaming segment for this xid */
> + pq_sendint32(out, first_segment ? 1 : 0);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_start(StringInfo in, bool *first_segment)
> +{
> + TransactionId xid;
> +
> + Assert(first_segment);
> +
> + xid = pq_getmsgint(in, 4);
> + *first_segment = (pq_getmsgint(in, 4) == 1);
> +
> + return xid;
> +}
>
> In these functions for sending bool, pq_sendint32 is used. Can't we
> use pq_sendbyte similar to what we do in boolsend?
Done
> 7.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
>
> In comments, 'starting to stream' is mentioned whereas this function
> is to stop it.
Fixed
> 8.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_stop(StringInfo in)
> +{
> + TransactionId xid;
> +
> + xid = pq_getmsgint(in, 4);
> +
> + return xid;
> +}
>
> Is there a reason to send xid on stopping stream? I don't see any use
> of function logicalrep_read_stream_stop.
Removed
> 9.
> + * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
> ..
> + pgstat_report_wait_end();
> ..
> }
>
> I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
> this function, so not sure if the above comment makes sense.
Fixed
> 10.
> + * The files are placed in /tmp by default, and the filenames include both
> + * the XID of the toplevel transaction and OID of the subscription.
>
> Are we keeping files in /tmp or pg's temp tablespace dir. Seeing
> below code, it doesn't seem that we place them in /tmp. If I am
> correct, then can you update the comment.
> +static void
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
Done
> 11.
> + * The change is serialied in a simple format, with length (not including
> + * the length), action code (identifying the message type) and message
> + * contents (without the subxact TransactionId value).
> + *
> ..
> + */
> +static void
> +stream_write_change(char action, StringInfo s)
>
> The part of the comment which says "with length (not including the
> length) .." is not clear to me. What does "not including the length"
> mean?
Basically, it says that the 4 bytes which are used for storing then
the length of total data doesn't include the 4 bytes.
> 12.
> + * TODO: Add missing_ok flag to specify in which cases it's OK not to
> + * find the files, and when it's an error.
> + */
> +static void
> +stream_cleanup_files(Oid subid, TransactionId xid)
>
> I think we can implement this TODO. It is clear when this function is
> called from apply_handle_stream_commit, the file must exist. We can
> similarly analyze other callers of this API.
Done
> 13.
> +apply_handle_stream_abort(StringInfo s)
> {
> ..
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> ..
>
> I am not sure how important this optimization is, so instead of FIXME,
> it is better to keep it as a XXX comment. In the future, if we hit
> any performance issue due to this, we can revisit our decision.
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v10-0002-Issue-individual-invalidations-with-wal_level-lo.patch | application/octet-stream | 16.4 KB |
v10-0003-Extend-the-output-plugin-API-with-stream-methods.patch | application/octet-stream | 34.8 KB |
v10-0005-Implement-streaming-mode-in-ReorderBuffer.patch | application/octet-stream | 37.8 KB |
v10-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch | application/octet-stream | 12.6 KB |
v10-0008-Enable-streaming-for-all-subscription-TAP-tests.patch | application/octet-stream | 14.7 KB |
v10-0009-Add-TAP-test-for-streaming-vs.-DDL.patch | application/octet-stream | 4.4 KB |
v10-0006-Add-support-for-streaming-to-built-in-replicatio.patch | application/octet-stream | 89.9 KB |
v10-0007-Track-statistics-for-streaming.patch | application/octet-stream | 11.7 KB |
v10-0010-Bugfix-handling-of-incomplete-toast-tuple.patch | application/octet-stream | 13.3 KB |
v10-0001-Immediately-WAL-log-assignments.patch | application/octet-stream | 10.3 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-13 03:12:21 |
Message-ID: | CAFiTN-vR7jQBP75FpvE=-4Ca-3jqsJA55ZDct0B8UtE0x13gCQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
The patch set was not applying on the head so I have rebased it.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v11-0001-Immediately-WAL-log-assignments.patch | application/octet-stream | 10.3 KB |
v11-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch | application/octet-stream | 12.6 KB |
v11-0002-Issue-individual-invalidations-with-wal_level-lo.patch | application/octet-stream | 16.4 KB |
v11-0003-Extend-the-output-plugin-API-with-stream-methods.patch | application/octet-stream | 34.8 KB |
v11-0005-Implement-streaming-mode-in-ReorderBuffer.patch | application/octet-stream | 37.8 KB |
v11-0007-Track-statistics-for-streaming.patch | application/octet-stream | 11.7 KB |
v11-0006-Add-support-for-streaming-to-built-in-replicatio.patch | application/octet-stream | 89.9 KB |
v11-0008-Enable-streaming-for-all-subscription-TAP-tests.patch | application/octet-stream | 14.7 KB |
v11-0009-Add-TAP-test-for-streaming-vs.-DDL.patch | application/octet-stream | 4.4 KB |
v11-0010-Bugfix-handling-of-incomplete-toast-tuple.patch | application/octet-stream | 13.3 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-02-29 05:07:44 |
Message-ID: | CAFiTN-uNzEyY3YL1EZabWcG=Bah1nO8hno+-UaAkCDcfCGYC0g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Feb 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> The patch set was not applying on the head so I have rebased it.
I have changed the patch 0002 so that instead of logging the WAL for
each invalidation, now we log at each command end as discussed
upthread[1]
Soon we will evaluate the performance for the same and post the results.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v12-0001-Immediately-WAL-log-assignments.patch | application/octet-stream | 10.3 KB |
v12-0002-Issue-individual-invalidations-with-wal_level-lo.patch | application/octet-stream | 17.9 KB |
v12-0005-Implement-streaming-mode-in-ReorderBuffer.patch | application/octet-stream | 37.8 KB |
v12-0003-Extend-the-output-plugin-API-with-stream-methods.patch | application/octet-stream | 34.8 KB |
v12-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch | application/octet-stream | 12.6 KB |
v12-0007-Track-statistics-for-streaming.patch | application/octet-stream | 11.7 KB |
v12-0006-Add-support-for-streaming-to-built-in-replicatio.patch | application/octet-stream | 89.8 KB |
v12-0008-Enable-streaming-for-all-subscription-TAP-tests.patch | application/octet-stream | 14.7 KB |
v12-0009-Add-TAP-test-for-streaming-vs.-DDL.patch | application/octet-stream | 4.4 KB |
v12-0010-Bugfix-handling-of-incomplete-toast-tuple.patch | application/octet-stream | 13.3 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-03 21:46:46 |
Message-ID: | 20200303214646.6adway2z5wnapjem@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
I started looking at this patch series again, hoping to get it moving
for PG13. There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...
The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.
I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...
The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.
For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.
The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).
Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.
Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro
txid = XLogRecGetTopXid(record);
but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:
ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-03 21:59:11 |
Message-ID: | 20200303215911.sxlyi6nsbs2lzwhm@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
D'oh! As usual I forgot to actually attach the patch I mentioned. So
here it is ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
xid-assignment-v12-fix.patch | text/plain | 947 bytes |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-04 03:43:49 |
Message-ID: | CAFiTN-sn0Bd4Q72Cn9t8-mZp1sxWRv2SmUoeR_7J05SqNTR--g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.
Nice.
There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>
> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding? The main challenge is that now
> the assignment can be sent in many different xlog messages, from a bunch
> of resource managers (essentially, any xlog message with a xid can have
> embedded XID of the toplevel xact). So the handling would either need to
> happen in every rmgr, or we need to move it before we call the rmgr.
>
> For exampple, we might do this e.g. in StartupXLOG() I think, per the
> attached patch (FWIW this particular fix was written by Masahiko Sawada,
> not me). This does the trick for me - I'm no longer able to reproduce
> the KnownAssignedXids overflow.
>
> The one difference is that we used to call ProcArrayApplyXidAssignment
> for larger groups of XIDs, as sent in the assignment message. Now we
> call it for each individual assignment. I don't know if this is an
> issue, but I suppose we might introduce some sort of local caching
> (accumulate the assignments into a local array, call the function only
> when we have enough of them).
Thanks for the pointers, I will think over these points.
>
> Aside from that, I think there's a minor bug in xact.c - the patch adds
> a "assigned" field to TransactionStateData, but then it fails to add a
> default value into TopTransactionStateData. We probably interpret NULL
> as false, but then there's nothing for the pointer. I suspect it might
> leave some random garbage there, leading to strange things later.
Actually, we will never access that field for the
TopTransactionStateData, right?
See below code, we have a check that only if IsSubTransaction(), then
we access the "assigned" filed.
+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
>
> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
> toplevel XID using a macro
>
> txid = XLogRecGetTopXid(record);
>
> but then it just starts accessing the fields directly again in the
> ReorderBufferAssignChild call. I think we should do this instead:
>
> ReorderBufferAssignChild(ctx->reorder,
> txid,
> XLogRecGetXid(record),
> buf.origptr);
Make sense. I will change this in the patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-04 04:58:32 |
Message-ID: | CAA4eK1JNEx9uR+qvK7CyZW9jSESOLphW+x5F51w-_BDBe79-Gw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.
>
It is good to keep moving this forward, but there are quite a few
problems with the design which need a broader discussion. Some of
what I recall are:
a. Handling of abort of concurrent transactions. There is some code
in the patch which might work, but there is not much discussion when
it was posted.
b. Handling of partial tuples (while streaming, we came to know that
toast tuple is not complete or speculative insert is incomplete). For
this also, we have proposed a few solutions which need further
discussion. One of those is implemented in the patch series.
c. We might also need some handling for replication origins.
d. Try to minimize the performance overhead of WAL logging for
invalidations. We discussed different solutions for this and
implemented one of those.
e. How to skip already streamed transactions.
There might be a few more which I can't recall now. Apart from this,
I haven't done any detailed review of subscriber-side implementation
where we write streamed transactions to file. All of this will need
much more discussion and review before we can say it is ready to
commit, so I thought it might be better to pick it up for PG14 and
focus on other things that have a better chance for PG13 especially
because all the problems were not solved/discussed before last CF.
However, it is a good idea to keep moving this and have a discussion
on some of these issues.
> There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>
So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions? We might want to check the
pgbench read-write test to see if that will add any significant
overhead.
> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding?
>
I haven't thought about it in detail, but we can think on those lines
if the performance overhead is in the acceptable range.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-04 09:03:40 |
Message-ID: | CAA4eK1K+n6d7hhKn5jzpxNWRT51RA6mKoyo+aqvDerqPZgDDuA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
>
> So, aren't we worried about the overhead of the amount of WAL and
> performance impact for the transactions? We might want to check the
> pgbench read-write test to see if that will add any significant
> overhead.
>
I have briefly looked at the original patch and it seems the
additional overhead is only when subtransactions are involved, so
ideally, it shouldn't impact default pgbench, but there is no harm in
checking. It might be that we need to build a custom script with
subtransactions involved to measure the impact, but I think it is
worth checking
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-04 09:10:20 |
Message-ID: | CAFiTN-vMFU=QSrfFSXNvae7zOtCtN0vNMAkaby4sgPkLwU=qTQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 4, 2020 at 2:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> >
> > So, aren't we worried about the overhead of the amount of WAL and
> > performance impact for the transactions? We might want to check the
> > pgbench read-write test to see if that will add any significant
> > overhead.
> >
>
> I have briefly looked at the original patch and it seems the
> additional overhead is only when subtransactions are involved, so
> ideally, it shouldn't impact default pgbench, but there is no harm in
> checking. It might be that we need to build a custom script with
> subtransactions involved to measure the impact, but I think it is
> worth checking
I agree. I will test the same and post the results.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-05 17:50:32 |
Message-ID: | 20200305175032.4iolgyumq4aomiwu@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>>
>
>It is good to keep moving this forward, but there are quite a few
>problems with the design which need a broader discussion. Some of
>what I recall are:
>a. Handling of abort of concurrent transactions. There is some code
>in the patch which might work, but there is not much discussion when
>it was posted.
>b. Handling of partial tuples (while streaming, we came to know that
>toast tuple is not complete or speculative insert is incomplete). For
>this also, we have proposed a few solutions which need further
>discussion. One of those is implemented in the patch series.
>c. We might also need some handling for replication origins.
>d. Try to minimize the performance overhead of WAL logging for
>invalidations. We discussed different solutions for this and
>implemented one of those.
>e. How to skip already streamed transactions.
>
>There might be a few more which I can't recall now. Apart from this,
>I haven't done any detailed review of subscriber-side implementation
>where we write streamed transactions to file. All of this will need
>much more discussion and review before we can say it is ready to
>commit, so I thought it might be better to pick it up for PG14 and
>focus on other things that have a better chance for PG13 especially
>because all the problems were not solved/discussed before last CF.
>However, it is a good idea to keep moving this and have a discussion
>on some of these issues.
>
Sure, there's a lot to discuss. And it's possible (likely) it's not
feasible to get this into PG13. But I think it's still worth discussing
it, instead of just punting it into the next CF right away.
>> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>
>So, aren't we worried about the overhead of the amount of WAL and
>performance impact for the transactions? We might want to check the
>pgbench read-write test to see if that will add any significant
>overhead.
>
Well, sure. I agree we need to see how this affects performance, and
I'll do some benchmarks (I think I did that when submitting the patch,
but I don't recall the numbers / details).
Isn't it a bit strange to log stuff twice, though, if we worry about
performance? Surely that's more expensive than logging it just once. Of
course, it might be useful if most systems need just the "old" way.
I know it's going to be a bit hand-wavy, but I think embedding the
assignments into existing WAL messages is about the cheapest way to log
this. I would not expect this to be mesurably more expensive than what
we have now, but I might be wrong.
>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding?
>>
>
>I haven't thought about it in detail, but we can think on those lines
>if the performance overhead is in the acceptable range.
>
OK, let me do some measurements ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-05 17:55:47 |
Message-ID: | 20200305175547.lx7m7oynnkqppsfm@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 04, 2020 at 09:13:49AM +0530, Dilip Kumar wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>
>Nice.
>
> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding? The main challenge is that now
>> the assignment can be sent in many different xlog messages, from a bunch
>> of resource managers (essentially, any xlog message with a xid can have
>> embedded XID of the toplevel xact). So the handling would either need to
>> happen in every rmgr, or we need to move it before we call the rmgr.
>>
>> For exampple, we might do this e.g. in StartupXLOG() I think, per the
>> attached patch (FWIW this particular fix was written by Masahiko Sawada,
>> not me). This does the trick for me - I'm no longer able to reproduce
>> the KnownAssignedXids overflow.
>>
>> The one difference is that we used to call ProcArrayApplyXidAssignment
>> for larger groups of XIDs, as sent in the assignment message. Now we
>> call it for each individual assignment. I don't know if this is an
>> issue, but I suppose we might introduce some sort of local caching
>> (accumulate the assignments into a local array, call the function only
>> when we have enough of them).
>
>Thanks for the pointers, I will think over these points.
>
>>
>> Aside from that, I think there's a minor bug in xact.c - the patch adds
>> a "assigned" field to TransactionStateData, but then it fails to add a
>> default value into TopTransactionStateData. We probably interpret NULL
>> as false, but then there's nothing for the pointer. I suspect it might
>> leave some random garbage there, leading to strange things later.
>
>Actually, we will never access that field for the
>TopTransactionStateData, right?
>See below code, we have a check that only if IsSubTransaction(), then
>we access the "assigned" filed.
>
>+bool
>+IsSubTransactionAssignmentPending(void)
>+{
>+ if (!XLogLogicalInfoActive())
>+ return false;
>+
>+ /* we need to be in a transaction state */
>+ if (!IsTransactionState())
>+ return false;
>+
>+ /* it has to be a subtransaction */
>+ if (!IsSubTransaction())
>+ return false;
>+
>+ /* the subtransaction has to have a XID assigned */
>+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>+ return false;
>+
>+ /* and it needs to have 'assigned' */
>+ return !CurrentTransactionState->assigned;
>+
>+}
>
The problem is not with the "assigned" field, really. AFAICS we probably
initialize it to false because we interpret NULL as false. My concern
was that we essentially leave the last pointer not initialized. That
seems like a bug, not sure if it breaks something in practice.
>>
>> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
>> toplevel XID using a macro
>>
>> txid = XLogRecGetTopXid(record);
>>
>> but then it just starts accessing the fields directly again in the
>> ReorderBufferAssignChild call. I think we should do this instead:
>>
>> ReorderBufferAssignChild(ctx->reorder,
>> txid,
>> XLogRecGetXid(record),
>> buf.origptr);
>
>Make sense. I will change this in the patch.
>
+1, thanks
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-06 04:49:24 |
Message-ID: | CAA4eK1Ki96NDraAGdDhY3yWtDGFyTAkHW+TBERC5JsM-184X1w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Mar 5, 2020 at 11:20 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
> >
>
> Sure, there's a lot to discuss. And it's possible (likely) it's not
> feasible to get this into PG13. But I think it's still worth discussing
> it, instead of just punting it into the next CF right away.
>
That makes sense to me.
> >> There's been a tremendous amount of work done since I last
> >> worked on it, and a lot was discussed on this thread, so it'll take a
> >> while to get familiar with the new code ...
> >>
> >> The first thing I realized that WAL-logging of assignments in v12 does
> >> both the "old" logging (using dedicated message) and "new" with
> >> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> >> it was trivial to crash the replica due to KnownAssignedXids overflow.
> >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> >> right fix.
> >>
> >> I actually proposed doing this (having both ways to log assignments) so
> >> that there's no regression risk with (wal_level < logical). But IIRC
> >> Andres objected to it, argumenting that we should not log the same piece
> >> of information in two very different ways at the same time (IIRC it was
> >> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> >> And I do agree with him ...
> >>
> >
> >So, aren't we worried about the overhead of the amount of WAL and
> >performance impact for the transactions? We might want to check the
> >pgbench read-write test to see if that will add any significant
> >overhead.
> >
>
> Well, sure. I agree we need to see how this affects performance, and
> I'll do some benchmarks (I think I did that when submitting the patch,
> but I don't recall the numbers / details).
>
> Isn't it a bit strange to log stuff twice, though, if we worry about
> performance? Surely that's more expensive than logging it just once. Of
> course, it might be useful if most systems need just the "old" way.
>
> I know it's going to be a bit hand-wavy, but I think embedding the
> assignments into existing WAL messages is about the cheapest way to log
> this. I would not expect this to be mesurably more expensive than what
> we have now, but I might be wrong.
>
I agree that this shouldn't be much expensive, but it is better to be
sure in that regard.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-28 06:25:53 |
Message-ID: | CAA4eK1KyrP6m53FQV5v0gysEnaTHPdc2xYT0KeNN+fhOcedRwA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
> > The question is, why couldn't the replica use the same assignment info
> > we already write for logical decoding? The main challenge is that now
> > the assignment can be sent in many different xlog messages, from a bunch
> > of resource managers (essentially, any xlog message with a xid can have
> > embedded XID of the toplevel xact). So the handling would either need to
> > happen in every rmgr, or we need to move it before we call the rmgr.
> >
> > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > not me). This does the trick for me - I'm no longer able to reproduce
> > the KnownAssignedXids overflow.
> >
> > The one difference is that we used to call ProcArrayApplyXidAssignment
> > for larger groups of XIDs, as sent in the assignment message. Now we
> > call it for each individual assignment. I don't know if this is an
> > issue, but I suppose we might introduce some sort of local caching
> > (accumulate the assignments into a local array, call the function only
> > when we have enough of them).
>
> Thanks for the pointers, I will think over these points.
>
I have looked at the solution proposed and I would like to share my
findings. I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby. Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.
(b) Calling ProcArrayApplyXidAssignment() for each subtransaction can
be costly from the perspective of concurrency because it acquires
ProcArrayLock in Exclusive mode, so concurrently running transactions
might start blocking at this lock. Also, I see that
SubTransSetParent() makes the page dirty, so it might lead to more
writes if we spread out setting that by calling it separately for each
sub-transaction.
Apart from this, I don't see how the proposed fix is correct because
as far as I can see it tries to remove the Xid before we even record
it via RecordKnownAssignedTransactionIds(). It seems after patch
RecordKnownAssignedTransactionIds() will be called after
ProcArrayApplyXidAssignment(), how could that be correct.
Thoughts?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-28 08:49:31 |
Message-ID: | CAFiTN-uBd3p50s=kfEiSbg93_Fnf_VU+haS4FqBZAm7PmbX69g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > >
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> > > The question is, why couldn't the replica use the same assignment info
> > > we already write for logical decoding? The main challenge is that now
> > > the assignment can be sent in many different xlog messages, from a bunch
> > > of resource managers (essentially, any xlog message with a xid can have
> > > embedded XID of the toplevel xact). So the handling would either need to
> > > happen in every rmgr, or we need to move it before we call the rmgr.
> > >
> > > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > > not me). This does the trick for me - I'm no longer able to reproduce
> > > the KnownAssignedXids overflow.
> > >
> > > The one difference is that we used to call ProcArrayApplyXidAssignment
> > > for larger groups of XIDs, as sent in the assignment message. Now we
> > > call it for each individual assignment. I don't know if this is an
> > > issue, but I suppose we might introduce some sort of local caching
> > > (accumulate the assignments into a local array, call the function only
> > > when we have enough of them).
> >
> > Thanks for the pointers, I will think over these points.
> >
>
> I have looked at the solution proposed and I would like to share my
> findings. I think calling ProcArrayApplyXidAssignment for each
> subtransaction is not a good idea for a couple of reasons:
> (a) It will just beat the purpose of maintaining KnowAssignedXids
> array which is to avoid looking at pg_subtrans in
> TransactionIdIsInProgress() on standby. Basically, if we remove it
> for each subXid, it will consider the KnowAssignedXids to be
> overflowed and check pg_subtrans frequently.
Right, I also think this is a problem with this solution. I think we
may try to avoid this by caching this information. But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction. I haven't thought about how
much complexity this solution will add.
> (b) Calling ProcArrayApplyXidAssignment() for each subtransaction can
> be costly from the perspective of concurrency because it acquires
> ProcArrayLock in Exclusive mode, so concurrently running transactions
> might start blocking at this lock.
Right
Also, I see that
> SubTransSetParent() makes the page dirty, so it might lead to more
> writes if we spread out setting that by calling it separately for each
> sub-transaction.
Right.
>
> Apart from this, I don't see how the proposed fix is correct because
> as far as I can see it tries to remove the Xid before we even record
> it via RecordKnownAssignedTransactionIds(). It seems after patch
> RecordKnownAssignedTransactionIds() will be called after
> ProcArrayApplyXidAssignment(), how could that be correct.
Valid point.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-28 09:59:34 |
Message-ID: | CAA4eK1LkLU5frY4kZGO38d8m6R504H+GbfkwsQLAp7sQ-4oqbw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > I have looked at the solution proposed and I would like to share my
> > findings. I think calling ProcArrayApplyXidAssignment for each
> > subtransaction is not a good idea for a couple of reasons:
> > (a) It will just beat the purpose of maintaining KnowAssignedXids
> > array which is to avoid looking at pg_subtrans in
> > TransactionIdIsInProgress() on standby. Basically, if we remove it
> > for each subXid, it will consider the KnowAssignedXids to be
> > overflowed and check pg_subtrans frequently.
>
> Right, I also think this is a problem with this solution. I think we
> may try to avoid this by caching this information. But, then we will
> have to maintain this in some dimensional array which stores
> sub-transaction ids per top transaction or we can maintain a list of
> sub-transaction for each transaction. I haven't thought about how
> much complexity this solution will add.
>
How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction? Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record. One idea could be to
use a flag bit in XLogRecord.xl_info. If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-29 00:59:05 |
Message-ID: | 20200329005905.ooxq25qyjwn2mvxz@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>>
>> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> >
>> >
>> > I have looked at the solution proposed and I would like to share my
>> > findings. I think calling ProcArrayApplyXidAssignment for each
>> > subtransaction is not a good idea for a couple of reasons:
>> > (a) It will just beat the purpose of maintaining KnowAssignedXids
>> > array which is to avoid looking at pg_subtrans in
>> > TransactionIdIsInProgress() on standby. Basically, if we remove it
>> > for each subXid, it will consider the KnowAssignedXids to be
>> > overflowed and check pg_subtrans frequently.
>>
>> Right, I also think this is a problem with this solution. I think we
>> may try to avoid this by caching this information. But, then we will
>> have to maintain this in some dimensional array which stores
>> sub-transaction ids per top transaction or we can maintain a list of
>> sub-transaction for each transaction. I haven't thought about how
>> much complexity this solution will add.
>>
>
>How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>flag in TransactionStateData and then log that as special information
>whenever we write next WAL record for a new subtransaction? Then
>during recovery, we can only call ProcArrayApplyXidAssignment when we
>find that special flag is set in a WAL record. One idea could be to
>use a flag bit in XLogRecord.xl_info. If that is feasible then the
>solution can work as it is now, without any overhead or change in the
>way we maintain KnownAssignedXids.
>
Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?
Anyway, I think you're right the ProcArrayApplyXidAssignment call was
done too early, but I think that can be fixed by moving it until after
the RecordKnownAssignedTransactionIds call, no? Essentially, right
before rm_redo().
You're right calling ProcArrayApplyXidAssignment() may be an issue,
because it exclusively acquires the ProcArrayLock. I've actually hinted
that might be an issue in my original message, suggesting we might add a
local cache of assigned XIDs (a small static array, doing essentially
the same thing we used to do on the upstream node). I haven't done that
in my WIP patch to keep it simple, but AFACS it'd work.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-29 05:49:21 |
Message-ID: | CAA4eK1LvUDNaCJUZEh5BhV31U6OnNWP=OCUmyZM2k4JGx5CXvQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
> >flag in TransactionStateData and then log that as special information
> >whenever we write next WAL record for a new subtransaction? Then
> >during recovery, we can only call ProcArrayApplyXidAssignment when we
> >find that special flag is set in a WAL record. One idea could be to
> >use a flag bit in XLogRecord.xl_info. If that is feasible then the
> >solution can work as it is now, without any overhead or change in the
> >way we maintain KnownAssignedXids.
> >
>
> Ummm, how is that different from what the patch is doing now? I mean, we
> only write the top-level XID for the first WAL record in each subxact,
> right? Or what would be the difference with your approach?
>
We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-29 15:31:05 |
Message-ID: | 20200329153105.mdryxzz562tg65pk@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>> >flag in TransactionStateData and then log that as special information
>> >whenever we write next WAL record for a new subtransaction? Then
>> >during recovery, we can only call ProcArrayApplyXidAssignment when we
>> >find that special flag is set in a WAL record. One idea could be to
>> >use a flag bit in XLogRecord.xl_info. If that is feasible then the
>> >solution can work as it is now, without any overhead or change in the
>> >way we maintain KnownAssignedXids.
>> >
>>
>> Ummm, how is that different from what the patch is doing now? I mean, we
>> only write the top-level XID for the first WAL record in each subxact,
>> right? Or what would be the difference with your approach?
>>
>
>We have to do what the patch is currently doing and additionally, we
>will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>us to call ProcArrayApplyXidAssignment during WAL replay only after
>PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
>clearing the KnownAssignedXids at the same time as we do now, so no
>additional performance overhead.
>
Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-30 06:17:57 |
Message-ID: | CAA4eK1JHxMaeQhvfpaBpUCLoeg3=RbbQQ_h0O0WwOU2kssRFsg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> Ummm, how is that different from what the patch is doing now? I mean, we
> >> only write the top-level XID for the first WAL record in each subxact,
> >> right? Or what would be the difference with your approach?
> >>
> >
> >We have to do what the patch is currently doing and additionally, we
> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
> >us to call ProcArrayApplyXidAssignment during WAL replay only after
> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
> >clearing the KnownAssignedXids at the same time as we do now, so no
> >additional performance overhead.
> >
>
> Hmmm. So we'd still log assignment twice? Or would we keep just the
> immediate assignments (embedded into xlog records), and cache the
> subxids on the replica somehow?
>
I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it. Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it. I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids. This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.
Do you have any other ideas?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-03-30 15:27:58 |
Message-ID: | 20200330152758.fgp62lwb2ug7cl2a@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
>> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >>
>> >> Ummm, how is that different from what the patch is doing now? I mean, we
>> >> only write the top-level XID for the first WAL record in each subxact,
>> >> right? Or what would be the difference with your approach?
>> >>
>> >
>> >We have to do what the patch is currently doing and additionally, we
>> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>> >us to call ProcArrayApplyXidAssignment during WAL replay only after
>> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in
>> >clearing the KnownAssignedXids at the same time as we do now, so no
>> >additional performance overhead.
>> >
>>
>> Hmmm. So we'd still log assignment twice? Or would we keep just the
>> immediate assignments (embedded into xlog records), and cache the
>> subxids on the replica somehow?
>>
>
>I think we need to cache the subxids on the replica somehow but I
>don't have a very good idea for it. Basically, there are two ways to
>do it (a) Change the KnownAssignedXids in some way so that we can
>easily find this information without losing on the current benefits of
>it. I can't think of a good way to do that and even if we come up
>with something, it could easily be a lot of work, (b) Cache the
>subxids for a particular transaction in local memory along with
>KnownAssignedXids. This is doable but now we have two data-structures
>(one in shared memory and other in local memory) managing the same
>information in different ways.
>
>Do you have any other ideas?
I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?
I think the single array would be sufficient, but the hash table would
allow keeping the apply logic more or less as it's today. See the
attached patch that adds such cache - I do admit I haven't tested this,
but hopefully it's a sufficient illustration of the idea.
It does not handle cleanup of the cache, but I think that should not be
difficult - we simply need to remove entries for transactions that got
committed or rolled back. And do something about transactions without an
explicit commit/rollback record, but that can be done by also handling
XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid).
I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
xid-assignment-v13-fix.patch | text/plain | 3.1 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-07 06:47:44 |
Message-ID: | CAA4eK1JMdBkYQF-fjSesb713Fib_v+ihFO-ZdDWpio8CQ7Yr_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >
> >I think we need to cache the subxids on the replica somehow but I
> >don't have a very good idea for it. Basically, there are two ways to
> >do it (a) Change the KnownAssignedXids in some way so that we can
> >easily find this information without losing on the current benefits of
> >it. I can't think of a good way to do that and even if we come up
> >with something, it could easily be a lot of work, (b) Cache the
> >subxids for a particular transaction in local memory along with
> >KnownAssignedXids. This is doable but now we have two data-structures
> >(one in shared memory and other in local memory) managing the same
> >information in different ways.
> >
> >Do you have any other ideas?
>
> I don't follow. Why couldn't we have a simple cache on the standby? It
> could be either a simple array or a hash table (with the top-level xid
> as hash key)?
>
I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds). I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.
>
> I don't think this is particularly complicated or a lot of code, and I
> don't see why would it require data structures in shared memory. Only
> the walreceiver on standby needs to worry about this, no?
>
Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways. And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures. However, I think before
discussing more on this, we need to think about the above problem.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-08 00:59:05 |
Message-ID: | 20200408005905.misvqmjk5c7wd5vr@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
>On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>> >
>> >I think we need to cache the subxids on the replica somehow but I
>> >don't have a very good idea for it. Basically, there are two ways to
>> >do it (a) Change the KnownAssignedXids in some way so that we can
>> >easily find this information without losing on the current benefits of
>> >it. I can't think of a good way to do that and even if we come up
>> >with something, it could easily be a lot of work, (b) Cache the
>> >subxids for a particular transaction in local memory along with
>> >KnownAssignedXids. This is doable but now we have two data-structures
>> >(one in shared memory and other in local memory) managing the same
>> >information in different ways.
>> >
>> >Do you have any other ideas?
>>
>> I don't follow. Why couldn't we have a simple cache on the standby? It
>> could be either a simple array or a hash table (with the top-level xid
>> as hash key)?
>>
>
>I think having something like we discussed or what you have in the
>patch won't be sufficient to clean the KnownAssignedXid array. The
>point is that we won't write a WAL for xid-subxid association for
>unlogged relations in the "Immediately WAL-log assignments" patch,
>however, the KnownAssignedXid would have both kinds of Xids as we
>autofill it with gaps (see RecordKnownAssignedTransactionIds). I
>think if my understanding is correct to make it work we might need
>major surgery in the code or have to maintain KnownAssignedXid array
>differently.
Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).
I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record). But I admit it seems a bit ugly.
>>
>> I don't think this is particularly complicated or a lot of code, and I
>> don't see why would it require data structures in shared memory. Only
>> the walreceiver on standby needs to worry about this, no?
>>
>
>Not a new data structure in shared memory, but we already have a
>KnownTransactionId structure in shared memory. So, after having a
>local cache, we will have xidAssignmentsHash and KnownTransactionId
>maintaining the same information in different ways. And, we need to
>ensure both are cleaned up properly. That was what I was pointing
>above related to maintaining two structures. However, I think before
>discussing more on this, we need to think about the above problem.
>
Sure.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-09 09:10:17 |
Message-ID: | CAFiTN-uF4O+Uh_dQX1Um1xJYSKCBUrEB2o-etumUch8F3cEhGA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
> >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >> >
> >> >I think we need to cache the subxids on the replica somehow but I
> >> >don't have a very good idea for it. Basically, there are two ways to
> >> >do it (a) Change the KnownAssignedXids in some way so that we can
> >> >easily find this information without losing on the current benefits of
> >> >it. I can't think of a good way to do that and even if we come up
> >> >with something, it could easily be a lot of work, (b) Cache the
> >> >subxids for a particular transaction in local memory along with
> >> >KnownAssignedXids. This is doable but now we have two data-structures
> >> >(one in shared memory and other in local memory) managing the same
> >> >information in different ways.
> >> >
> >> >Do you have any other ideas?
> >>
> >> I don't follow. Why couldn't we have a simple cache on the standby? It
> >> could be either a simple array or a hash table (with the top-level xid
> >> as hash key)?
> >>
> >
> >I think having something like we discussed or what you have in the
> >patch won't be sufficient to clean the KnownAssignedXid array. The
> >point is that we won't write a WAL for xid-subxid association for
> >unlogged relations in the "Immediately WAL-log assignments" patch,
> >however, the KnownAssignedXid would have both kinds of Xids as we
> >autofill it with gaps (see RecordKnownAssignedTransactionIds). I
> >think if my understanding is correct to make it work we might need
> >major surgery in the code or have to maintain KnownAssignedXid array
> >differently.
>
> Hmm, that's a good point. If I understand correctly, the issue is
> that if we create new subxact, write something into an unlogged table,
> and then create new subxact, the XID of the first subxact will be "known
> assigned" but we won't know it's a subxact or to which parent xact it
> belongs (because there will be no WAL records that could encode it).
>
> I wonder if there's a simple solution (e.g. when creating the second
> subxact we might notice the xid-subxid assignment was not logged, and
> write some "dummy" WAL record). But I admit it seems a bit ugly.
>
> >>
> >> I don't think this is particularly complicated or a lot of code, and I
> >> don't see why would it require data structures in shared memory. Only
> >> the walreceiver on standby needs to worry about this, no?
> >>
> >
> >Not a new data structure in shared memory, but we already have a
> >KnownTransactionId structure in shared memory. So, after having a
> >local cache, we will have xidAssignmentsHash and KnownTransactionId
> >maintaining the same information in different ways. And, we need to
> >ensure both are cleaned up properly. That was what I was pointing
> >above related to maintaining two structures. However, I think before
> >discussing more on this, we need to think about the above problem.
I have rebased the patch on the latest head. I haven't yet changed
anything for xid assignment thing because it is not yet concluded.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v13-0001-Immediately-WAL-log-assignments.patch | application/octet-stream | 10.5 KB |
v13-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch | application/octet-stream | 12.6 KB |
v13-0002-Issue-individual-invalidations-with-wal_level-lo.patch | application/octet-stream | 17.9 KB |
v13-0003-Extend-the-output-plugin-API-with-stream-methods.patch | application/octet-stream | 34.8 KB |
v13-0005-Implement-streaming-mode-in-ReorderBuffer.patch | application/octet-stream | 37.8 KB |
v13-0006-Add-support-for-streaming-to-built-in-replicatio.patch | application/octet-stream | 90.9 KB |
v13-0007-Track-statistics-for-streaming.patch | application/octet-stream | 11.8 KB |
v13-0009-Add-TAP-test-for-streaming-vs.-DDL.patch | application/octet-stream | 4.4 KB |
v13-0008-Enable-streaming-for-all-subscription-TAP-tests.patch | application/octet-stream | 14.7 KB |
v13-0010-Bugfix-handling-of-incomplete-toast-tuple.patch | application/octet-stream | 14.8 KB |
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-13 10:44:06 |
Message-ID: | CAGz5QCJZx_V96e2SrJ9RnuNHk=kbb4uKFsXaymP_Wykma1h5_g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have rebased the patch on the latest head. I haven't yet changed
> anything for xid assignment thing because it is not yet concluded.
>
Some review comments from 0001-Immediately-WAL-log-*.patch,
+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
@@ -195,6 +197,10 @@ XLogResetInsertion(void)
{
int i;
+ /* reset the subxact assignment flag (if needed) */
+ if (curinsert_flags & XLOG_INCLUDE_XID)
+ MarkSubTransactionAssigned();
The comment looks contradictory.
XLogSetRecordFlags(uint8 flags)
{
Assert(begininsert_called);
- curinsert_flags = flags;
+ curinsert_flags |= flags;
}
I didn't understand why we need this change in this patch.
+ txid = XLogRecGetTopXid(record);
+
+ /*
+ * If the toplevel_xid is valid, we need to assign the subxact to the
+ * toplevel transaction. We need to do this for all records, hence we
+ * do it before the switch.
+ */
s/toplevel_xid/toplevel xid or s/toplevel_xid/txid
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(r->toplevel_xid))
Perhaps, XLogRecGetTopXid() can be used.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-13 11:50:39 |
Message-ID: | CAFiTN-tWk1mztXSEyfm=kHbJsRUOkXrGs4frk0qWpwKDyuD57A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have rebased the patch on the latest head. I haven't yet changed
> > anything for xid assignment thing because it is not yet concluded.
> >
> Some review comments from 0001-Immediately-WAL-log-*.patch,
>
> +bool
> +IsSubTransactionAssignmentPending(void)
> +{
> + if (!XLogLogicalInfoActive())
> + return false;
> +
> + /* we need to be in a transaction state */
> + if (!IsTransactionState())
> + return false;
> +
> + /* it has to be a subtransaction */
> + if (!IsSubTransaction())
> + return false;
> +
> + /* the subtransaction has to have a XID assigned */
> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
> + return false;
> +
> + /* and it needs to have 'assigned' */
> + return !CurrentTransactionState->assigned;
> +
> +}
> IMHO, it's important to reduce the complexity of this function since
> it's been called for every WAL insertion. During the lifespan of a
> transaction, any of these if conditions will only be evaluated if
> previous conditions are true. So, we can maintain some state machine
> to avoid multiple evaluation of a condition inside a transaction. But,
> if the overhead is not much, it's not worth I guess.
Yeah maybe, in some cases we can avoid checking multiple conditions by
maintaining that state. But, that state will have to be at the
transaction level. But, I am not sure how much worth it will be to
add one extra condition to skip a few if checks and it will also add
the code complexity. And, in some cases where logical decoding is not
enabled, it may add one extra check? I mean first check the state and
that will take you to the first if check.
>
> +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> This looks wrong. We should change the name of this Macro or we can
> add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
>
> @@ -195,6 +197,10 @@ XLogResetInsertion(void)
> {
> int i;
>
> + /* reset the subxact assignment flag (if needed) */
> + if (curinsert_flags & XLOG_INCLUDE_XID)
> + MarkSubTransactionAssigned();
> The comment looks contradictory.
>
> XLogSetRecordFlags(uint8 flags)
> {
> Assert(begininsert_called);
> - curinsert_flags = flags;
> + curinsert_flags |= flags;
> }
> I didn't understand why we need this change in this patch.
I think it's changed so that below code can use it, but we have
directly set the flag. I think I will change in the next version.
@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
scratch += sizeof(replorigin_session_origin);
}
+ /* followed by toplevel XID, if not already included in previous record */
+ if (IsSubTransactionAssignmentPending())
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+
+ /* update the flag (later used by XLogInsertRecord) */
+ curinsert_flags |= XLOG_INCLUDE_XID;
>
> + txid = XLogRecGetTopXid(record);
> +
> + /*
> + * If the toplevel_xid is valid, we need to assign the subxact to the
> + * toplevel transaction. We need to do this for all records, hence we
> + * do it before the switch.
> + */
> s/toplevel_xid/toplevel xid or s/toplevel_xid/txid
Okay, we can change
> if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
> - info != XLOG_XACT_ASSIGNMENT)
> + !TransactionIdIsValid(r->toplevel_xid))
> Perhaps, XLogRecGetTopXid() can be used.
ok
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-13 12:41:56 |
Message-ID: | CAGz5QC+DMNxD8wJyxBtJR7c2krJTa1m_vF6XDz--EVfcgF+qzg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> >
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > This looks wrong. We should change the name of this Macro or we can
> > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
>
> I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
> make much sense to add different terminology no?
> #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
>
In that case, we can rename this, for example, SizeOfXLogTransactionId.
Some review comments from 0002-Issue-individual-*.path,
+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);
It seems we don't call the function if xid is not valid. In fact,
@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
case XLOG_XACT_ASSIGNMENT:
break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);
Why should we insert an WAL record for such cases?
+ * When wal_level=logical, write invalidations into WAL at each command end to
+ * support the decoding of the in-progress transaction. As of now it was
+ * enough to log invalidation only at commit because we are only decoding the
+ * transaction at the commit time. We only need to log the catalog cache and
+ * relcache invalidation. There can not be any active MVCC scan in logical
+ * decoding so we don't need to log the snapshot invalidation.
The alignment is not right.
/*
* CommandEndInvalidationMessages
- * Process queued-up invalidation messages at end of one command
- * in a transaction.
+ * Process queued-up invalidation messages at end of one command
+ * in a transaction.
Looks unnecessary changes.
* Note:
- * This should be called during CommandCounterIncrement(),
- * after we have advanced the command ID.
+ * This should be called during CommandCounterIncrement(),
+ * after we have advanced the command ID.
*/
Looks unnecessary changes.
if (transInvalInfo == NULL)
- return;
+ return;
Looks unnecessary changes.
+ /* prepare record */
+ memset(&xlrec, 0, sizeof(xlrec));
We should use MinSizeOfXactInvalidations, no?
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-13 13:04:34 |
Message-ID: | CAFiTN-tLpNYVU7++teYq5gCYn8doOBawLB2pFSCLHV_c0OqsnQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> > >
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > This looks wrong. We should change the name of this Macro or we can
> > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> >
> > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
> > make much sense to add different terminology no?
> > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> >
> In that case, we can rename this, for example, SizeOfXLogTransactionId.
Make sense.
>
> Some review comments from 0002-Issue-individual-*.path,
>
> +void
> +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr lsn, int nmsgs,
> + SharedInvalidationMessage *msgs)
> +{
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
> +
> + /* XXX Should we even write invalidations without valid XID? */
> + if (xid == InvalidTransactionId)
> + return;
> +
> + Assert(xid != InvalidTransactionId);
>
> It seems we don't call the function if xid is not valid. In fact,
>
> @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
> }
> case XLOG_XACT_ASSIGNMENT:
> break;
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + if (!TransactionIdIsValid(xid))
> + break;
> +
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Why should we insert an WAL record for such cases?
I think we can avoid this. I will analyze and send update in my next patch.
>
> + * When wal_level=logical, write invalidations into WAL at each command end to
> + * support the decoding of the in-progress transaction. As of now it was
> + * enough to log invalidation only at commit because we are only decoding the
> + * transaction at the commit time. We only need to log the catalog cache and
> + * relcache invalidation. There can not be any active MVCC scan in logical
> + * decoding so we don't need to log the snapshot invalidation.
> The alignment is not right.
Will fix.
> /*
> * CommandEndInvalidationMessages
> - * Process queued-up invalidation messages at end of one command
> - * in a transaction.
> + * Process queued-up invalidation messages at end of one command
> + * in a transaction.
> Looks unnecessary changes.
Will fix.
>
> * Note:
> - * This should be called during CommandCounterIncrement(),
> - * after we have advanced the command ID.
> + * This should be called during CommandCounterIncrement(),
> + * after we have advanced the command ID.
> */
> Looks unnecessary changes.
Will fix.
> if (transInvalInfo == NULL)
> - return;
> + return;
> Looks unnecessary changes.
>
> + /* prepare record */
> + memset(&xlrec, 0, sizeof(xlrec));
> We should use MinSizeOfXactInvalidations, no?
Right.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-13 18:13:34 |
Message-ID: | CAGz5QCLJK0QxV3jBcnEWd+cjz9en22=adQNFvmqpkCmR37wvZg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch
@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
ItemPointer tid)
ItemId lp = NULL;
HeapTupleHeader htup;
+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_hot_search call during logical decoding");
The call is to heap_finish_speculative.
@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
}
}
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
s/transaction aborted/transaction aborted concurrently perhaps? Also,
can we move this check at the begining of the function? If the
condition fails, we can skip the sys scan.
Some of the checks looks repetative in the same file. Should we
declare them as inline functions?
Review comments from 0005-Implement-streaming*.patch
+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+ dlist_iter iter;
...
+#endif
+}
We can implement the same as following:
#ifdef USE_ASSERT_CHECKING
static void
AssertChangeLsnOrder(ReorderBufferTXN *txn)
{
dlist_iter iter;
...
}
#else
#define AssertChangeLsnOrder(txn) ((void)true)
#endif
+ * if it is aborted we will report an specific error which we can ignore. We
s/an specific/a specific
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
s/last last/last
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
When we don't re-throw, the errdata should be freed by calling
FreeErrorData(errdata), right?
+ /*
+ * Set the last last of the stream as the final lsn before
+ * calling stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+
+ FlushErrorState();
+ }
stream_stop() can still throw some error, right? In that case, we
should flush the error state before calling stream_stop().
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+ txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
Hmm, it seems this part needs an assumption that after copying the
snapshot, no subsequent step can throw any error. If they do, then we
can again create a copy of the snapshot in catch block, which will
leak some memory. Is my understanding correct?
+ }
+ else
+ {
+ ReorderBufferCleanupTXN(rb, txn);
+ PG_RE_THROW();
+ }
Shouldn't we switch back to previously created error memory context
before re-throwing?
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id = FirstCommandId;
In the modified ReorderBufferCommit(), why is it necessary to declare
the above two variable as volatile? There is no try-catch block here.
@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
if (txn == NULL)
return;
+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort only if we have sent any data for this transaction.
+ */
+ if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+ rb->stream_abort(rb, txn, lsn);
+
s/When/If
+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort.
+ */
+ if (rbtxn_is_streamed(txn))
+ rb->stream_abort(rb, txn, lsn);
s/When/If. And, in this case, if we've not sent any data, why should
we send the abort message (similar to the previous one)?
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
Should we put any assert (not necessarily here) to validate the above comment?
+ txn = ReorderBufferLargestTopTXN(rb);
+
+ /* we know there has to be one, because the size is not zero */
+ Assert(txn && !txn->toptxn);
+ Assert(txn->size > 0);
+ Assert(rb->size >= txn->size);
The same three assertions are already there in ReorderBufferLargestTopTXN().
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
Potential inline function.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id;
Here also, do we need to declare these two variables as volatile?
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 06:34:11 |
Message-ID: | CAA4eK1L+rbP16UHZXgaWk38+pOyZtO0Be0Ba0VZcDmqq7Sf2Yg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
> >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> >I think having something like we discussed or what you have in the
> >patch won't be sufficient to clean the KnownAssignedXid array. The
> >point is that we won't write a WAL for xid-subxid association for
> >unlogged relations in the "Immediately WAL-log assignments" patch,
> >however, the KnownAssignedXid would have both kinds of Xids as we
> >autofill it with gaps (see RecordKnownAssignedTransactionIds). I
> >think if my understanding is correct to make it work we might need
> >major surgery in the code or have to maintain KnownAssignedXid array
> >differently.
>
> Hmm, that's a good point. If I understand correctly, the issue is
> that if we create new subxact, write something into an unlogged table,
> and then create new subxact, the XID of the first subxact will be "known
> assigned" but we won't know it's a subxact or to which parent xact it
> belongs (because there will be no WAL records that could encode it).
>
Yeah, there could be multiple such missing subxacts.
> I wonder if there's a simple solution (e.g. when creating the second
> subxact we might notice the xid-subxid assignment was not logged, and
> write some "dummy" WAL record).
>
That WAL record can have multiple xids.
> But I admit it seems a bit ugly.
>
Yeah, I guess it could be tricky as well because while assembling some
WAL record, we need to generate an additional dummy record or might
need to add additional information to the current record being formed.
I think the handling of such WAL records during hot-standby and in
logical decoding could vary. During logical decoding, currently, we
don't form an association for subtransactions if it doesn't have any
changes (see ReorderBufferCommitChild) and now with this new type of
record, I think we need to ensure that we don't form such association.
I think after quite some changes, tweaks and a lot of testing, we
might be able to remove XLOG_XACT_ASSIGNMENT but I am not sure if it
is worth doing along with this patch. I think it would have been good
to do this if we are adding any visible overhead with this patch and
or it is easy to do that. However, none of that seems to be true, so
it might be better to write good comments in the code indicating what
all we need to do to remove XLOG_XACT_ASSIGNMENT so that if we feel it
is important to do in future we can do so. I am not against spending
effort on this but I don't see the urgency of doing it along with this
patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 09:27:36 |
Message-ID: | CAA4eK1LTbtFsaQqXpSQ_dbObejDckXU8a-4Lxz7ej97fpW3Gvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> > >
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > This looks wrong. We should change the name of this Macro or we can
> > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> >
> > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
> > make much sense to add different terminology no?
> > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> >
> In that case, we can rename this, for example, SizeOfXLogTransactionId.
>
> Some review comments from 0002-Issue-individual-*.path,
>
> +void
> +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr lsn, int nmsgs,
> + SharedInvalidationMessage *msgs)
> +{
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
> +
> + /* XXX Should we even write invalidations without valid XID? */
> + if (xid == InvalidTransactionId)
> + return;
> +
> + Assert(xid != InvalidTransactionId);
>
> It seems we don't call the function if xid is not valid. In fact,
>
You have a valid point. Also, it is not clear if we are first
checking (xid == InvalidTransactionId) and returning from the
function, how can even Assert hit.
> @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
> }
> case XLOG_XACT_ASSIGNMENT:
> break;
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + if (!TransactionIdIsValid(xid))
> + break;
> +
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Why should we insert an WAL record for such cases?
>
Right, if there is any such case, we should avoid it.
One more point about this patch, the commit message needs to be updated:
> The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
> e.g. at the command level, or something like that?
I think the above part of commit message is not right as the patch
already does such a caching now at the command level.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 10:10:50 |
Message-ID: | CAFiTN-vy3_KA-+kTff1=UvKasp0r1BDScDuJv5hN2SEw1PHn_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 11:43 PM Kuntal Ghosh
<kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch
>
> @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
> ItemPointer tid)
> ItemId lp = NULL;
> HeapTupleHeader htup;
>
> + /*
> + * We don't expect direct calls to heap_hot_search with
> + * valid CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> + elog(ERROR, "unexpected heap_hot_search call during logical decoding");
> The call is to heap_finish_speculative.
Fixed
> @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
> }
> }
>
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
> s/transaction aborted/transaction aborted concurrently perhaps? Also,
> can we move this check at the begining of the function? If the
> condition fails, we can skip the sys scan.
We must check this after we get the tuple because our goal is, not to
decode based on the wrong tuple. And, if we move the check before
then, what if the transaction aborted after the check. Once we get
the tuple and if the transaction is alive by that time then it doesn't
matter even if it aborts because we have got the right tuple already.
>
> Some of the checks looks repetative in the same file. Should we
> declare them as inline functions?
>
> Review comments from 0005-Implement-streaming*.patch
>
> +static void
> +AssertChangeLsnOrder(ReorderBufferTXN *txn)
> +{
> +#ifdef USE_ASSERT_CHECKING
> + dlist_iter iter;
> ...
> +#endif
> +}
>
> We can implement the same as following:
> #ifdef USE_ASSERT_CHECKING
> static void
> AssertChangeLsnOrder(ReorderBufferTXN *txn)
> {
> dlist_iter iter;
> ...
> }
> #else
> #define AssertChangeLsnOrder(txn) ((void)true)
> #endif
I am not sure, this doesn't look clean. Moreover, the other similar
functions are defined in the same way. e.g. AssertTXNLsnOrder.
>
> + * if it is aborted we will report an specific error which we can ignore. We
> s/an specific/a specific
Done
>
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> s/last last/last
>
> PG_CATCH();
> {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData *errdata = CopyErrorData();
> When we don't re-throw, the errdata should be freed by calling
> FreeErrorData(errdata), right?
Done
>
> + /*
> + * Set the last last of the stream as the final lsn before
> + * calling stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> +
> + FlushErrorState();
> + }
> stream_stop() can still throw some error, right? In that case, we
> should flush the error state before calling stream_stop().
Done
>
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> + txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> Hmm, it seems this part needs an assumption that after copying the
> snapshot, no subsequent step can throw any error. If they do, then we
> can again create a copy of the snapshot in catch block, which will
> leak some memory. Is my understanding correct?
Actually, In CATCH we copy only if the error is
ERRCODE_TRANSACTION_ROLLBACK. And, that can occur during systable
scan. Basically, in TRY block we copy snapshot after we have streamed
all the changes i.e. systable scan is done, now if there is any error
that will not be ERRCODE_TRANSACTION_ROLLBACK. So we will not copy
again.
>
> + }
> + else
> + {
> + ReorderBufferCleanupTXN(rb, txn);
> + PG_RE_THROW();
> + }
> Shouldn't we switch back to previously created error memory context
> before re-throwing?
Fixed.
>
> +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
> + TimestampTz commit_time,
> + RepOriginId origin_id, XLogRecPtr origin_lsn)
> +{
> + ReorderBufferTXN *txn;
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id = FirstCommandId;
> In the modified ReorderBufferCommit(), why is it necessary to declare
> the above two variable as volatile? There is no try-catch block here.
Fixed
>
> @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
> TransactionId xid, XLogRecPtr lsn)
> if (txn == NULL)
> return;
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort only if we have sent any data for this transaction.
> + */
> + if (rbtxn_is_streamed(txn) && txn->any_data_sent)
> + rb->stream_abort(rb, txn, lsn);
> +
> s/When/If
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort.
> + */
> + if (rbtxn_is_streamed(txn))
> + rb->stream_abort(rb, txn, lsn);
> s/When/If. And, in this case, if we've not sent any data, why should
> we send the abort message (similar to the previous one)?
Fixed
>
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
> Should we put any assert (not necessarily here) to validate the above comment?
Because of toast handling, this assumption is changed now so I will
remove this note in that patch (0010).
>
> + txn = ReorderBufferLargestTopTXN(rb);
> +
> + /* we know there has to be one, because the size is not zero */
> + Assert(txn && !txn->toptxn);
> + Assert(txn->size > 0);
> + Assert(rb->size >= txn->size);
> The same three assertions are already there in ReorderBufferLargestTopTXN().
>
> +static bool
> +ReorderBufferCanStream(ReorderBuffer *rb)
> +{
> + LogicalDecodingContext *ctx = rb->private_data;
> +
> + return ctx->streaming;
> +}
> Potential inline function.
Done
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id;
> Here also, do we need to declare these two variables as volatile?
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v14-0001-Immediately-WAL-log-assignments.patch | application/octet-stream | 10.5 KB |
v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch | application/octet-stream | 34.8 KB |
v14-0002-Issue-individual-invalidations-with.patch | application/octet-stream | 16.7 KB |
v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch | application/octet-stream | 37.9 KB |
v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch | application/octet-stream | 12.3 KB |
v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch | application/octet-stream | 90.9 KB |
v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch | application/octet-stream | 14.7 KB |
v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch | application/octet-stream | 4.4 KB |
v14-0007-Track-statistics-for-streaming.patch | application/octet-stream | 11.8 KB |
v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch | application/octet-stream | 15.3 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 10:11:05 |
Message-ID: | CAFiTN-s1OutUTZ=qLqZ9eccGCxwiQHcXuiCxdNqDveXa2a88bg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> >
> > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> > > >
> > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > > This looks wrong. We should change the name of this Macro or we can
> > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> > >
> > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't
> > > make much sense to add different terminology no?
> > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > >
> > In that case, we can rename this, for example, SizeOfXLogTransactionId.
> >
> > Some review comments from 0002-Issue-individual-*.path,
> >
> > +void
> > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> > + XLogRecPtr lsn, int nmsgs,
> > + SharedInvalidationMessage *msgs)
> > +{
> > + MemoryContext oldcontext;
> > + ReorderBufferChange *change;
> > +
> > + /* XXX Should we even write invalidations without valid XID? */
> > + if (xid == InvalidTransactionId)
> > + return;
> > +
> > + Assert(xid != InvalidTransactionId);
> >
> > It seems we don't call the function if xid is not valid. In fact,
> >
>
> You have a valid point. Also, it is not clear if we are first
> checking (xid == InvalidTransactionId) and returning from the
> function, how can even Assert hit.
I have changed to code, now we only have an assert.
>
> > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > XLogRecordBuffer *buf)
> > }
> > case XLOG_XACT_ASSIGNMENT:
> > break;
> > + case XLOG_XACT_INVALIDATIONS:
> > + {
> > + TransactionId xid;
> > + xl_xact_invalidations *invals;
> > +
> > + xid = XLogRecGetXid(r);
> > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > +
> > + if (!TransactionIdIsValid(xid))
> > + break;
> > +
> > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > + invals->nmsgs, invals->msgs);
> >
> > Why should we insert an WAL record for such cases?
> >
>
> Right, if there is any such case, we should avoid it.
I think we don't have any such case because we are logging at the
command end. So I have created an assert instead of the check.
> One more point about this patch, the commit message needs to be updated:
>
> > The new invalidations are written to WAL immediately, without any
> such caching. Perhaps it would be possible to add similar caching,
> > e.g. at the command level, or something like that?
>
> I think the above part of commit message is not right as the patch
> already does such a caching now at the command level.
Right, I have removed that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 10:27:26 |
Message-ID: | CAA4eK1KeEYTC=MLcv1k1S2onvVu7tsj92NRepM-3YHBTQVJcYQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
>
> >
> > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > XLogRecordBuffer *buf)
> > > }
> > > case XLOG_XACT_ASSIGNMENT:
> > > break;
> > > + case XLOG_XACT_INVALIDATIONS:
> > > + {
> > > + TransactionId xid;
> > > + xl_xact_invalidations *invals;
> > > +
> > > + xid = XLogRecGetXid(r);
> > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > +
> > > + if (!TransactionIdIsValid(xid))
> > > + break;
> > > +
> > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > + invals->nmsgs, invals->msgs);
> > >
> > > Why should we insert an WAL record for such cases?
> > >
> >
> > Right, if there is any such case, we should avoid it.
>
> I think we don't have any such case because we are logging at the
> command end. So I have created an assert instead of the check.
>
Have you tried to ensure this in some way? One idea could be to add
an Assert (to check if transaction id is assigned) in the new code
where you are writing WAL for this action and then run make
check-world and or make installcheck-world.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 10:36:56 |
Message-ID: | CAFiTN-umwH=pN=UdTeT_vzk3QTCQp7qRY72qVyNR3S8ZTpOK5g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> >
> > >
> > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > > XLogRecordBuffer *buf)
> > > > }
> > > > case XLOG_XACT_ASSIGNMENT:
> > > > break;
> > > > + case XLOG_XACT_INVALIDATIONS:
> > > > + {
> > > > + TransactionId xid;
> > > > + xl_xact_invalidations *invals;
> > > > +
> > > > + xid = XLogRecGetXid(r);
> > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > > +
> > > > + if (!TransactionIdIsValid(xid))
> > > > + break;
> > > > +
> > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > > + invals->nmsgs, invals->msgs);
> > > >
> > > > Why should we insert an WAL record for such cases?
> > > >
> > >
> > > Right, if there is any such case, we should avoid it.
> >
> > I think we don't have any such case because we are logging at the
> > command end. So I have created an assert instead of the check.
> >
>
> Have you tried to ensure this in some way? One idea could be to add
> an Assert (to check if transaction id is assigned) in the new code
> where you are writing WAL for this action and then run make
> check-world and or make installcheck-world.
Yeah, I had already tested that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-14 15:44:26 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-14 12:10, Dilip Kumar wrote:
> v14-0001-Immediately-WAL-log-assignments.patch +
> v14-0002-Issue-individual-invalidations-with.patch +
> v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
> v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> v14-0007-Track-statistics-for-streaming.patch +
> v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
> v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
applied on top of 8128b0c (a few hours ago)
Hi,
I haven't followed this thread and maybe this instabilty is
known/expected; just thought I'd let you know.
When doing running a pgbench run over logical replication (cascading
down two replicas), I get this segmentation fault.
2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions
committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found
consistent point at 0/5FA2A00
2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running
transactions.
2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was
terminated by signal 11: Segmentation fault
2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running:
COMMIT
2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active
server processes
2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection
because of crash of another server process
2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be
able to reconnect to the database and repeat your command.
This error happens somewhat buried away in my test-stuff; I can dig it
out and make it into a repeatable test if you need it. (debian
stretch/gcc 9.3.0)
Erik Rijkers
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-15 04:44:06 |
Message-ID: | CAFiTN-tcFqePVwmtrx_+ycRgYswAcBKNutdKLAMMJWzh21JJRg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch +
> > v14-0002-Issue-individual-invalidations-with.patch +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)
>
> Hi,
>
> I haven't followed this thread and maybe this instabilty is
> known/expected; just thought I'd let you know.
>
> When doing running a pgbench run over logical replication (cascading
> down two replicas), I get this segmentation fault.
Thanks for the testing. Is it possible to share the call stack?
>
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions
> committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
> 2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found
> consistent point at 0/5FA2A00
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running
> transactions.
> 2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was
> terminated by signal 11: Segmentation fault
> 2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running:
> COMMIT
> 2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active
> server processes
> 2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection
> because of crash of another server process
> 2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has
> commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be
> able to reconnect to the database and repeat your command.
>
>
> This error happens somewhat buried away in my test-stuff; I can dig it
> out and make it into a repeatable test if you need it. (debian
> stretch/gcc 9.3.0)
Yeah, that will be great.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-16 09:33:18 |
Message-ID: | CAFiTN-tz+jjGHk4dkW=v==fJQZidobB7vp=_yhrT_ij0OLkiBA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch +
> > v14-0002-Issue-individual-invalidations-with.patch +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)
Hi Erik,
While setting up the cascading replication I have hit one issue on
base code[1]. After fixing that I have got one crash with streaming
on patch. I am not sure whether you are facing any of these 2 issues
or any other issue. If your issue is not any of these then plese
share the callstack and steps to reproduce.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
bugfix_in_schema_sent.patch | application/octet-stream | 416 bytes |
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-16 09:46:24 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-16 11:33, Dilip Kumar wrote:
> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>>
>> On 2020-04-14 12:10, Dilip Kumar wrote:
>>
>> > v14-0001-Immediately-WAL-log-assignments.patch +
>> > v14-0002-Issue-individual-invalidations-with.patch +
>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>> > v14-0007-Track-statistics-for-streaming.patch +
>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>>
>> applied on top of 8128b0c (a few hours ago)
>
I've added your new patch
[bugfix_replica_identity_full_on_subscriber.patch]
on top of all those above but the crash (apparently the same crash) that
I had earlier still occurs (and pretty soon).
server process (PID 1721) was terminated by signal 11: Segmentation
fault
I'll try to isolate it better and get a stacktrace
> Hi Erik,
>
> While setting up the cascading replication I have hit one issue on
> base code[1]. After fixing that I have got one crash with streaming
> on patch. I am not sure whether you are facing any of these 2 issues
> or any other issue. If your issue is not any of these then plese
> share the callstack and steps to reproduce.
>
> [1]
> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
From: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-16 20:10:09 |
Message-ID: | CAGz5QCKUDwAJCv39KNMFxCyfkbz7fdTDeQ=gpGynP71ksNMKug@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
Few review comments from 0006-Add-support-for-streaming*.patch
+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing). If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:
if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;
+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.
Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-16 21:25:04 |
Message-ID: | 20200416212504.2z77wtkjftqioghx@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote:
>On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>>
>> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> >
>> > I have rebased the patch on the latest head. I haven't yet changed
>> > anything for xid assignment thing because it is not yet concluded.
>> >
>> Some review comments from 0001-Immediately-WAL-log-*.patch,
>>
>> +bool
>> +IsSubTransactionAssignmentPending(void)
>> +{
>> + if (!XLogLogicalInfoActive())
>> + return false;
>> +
>> + /* we need to be in a transaction state */
>> + if (!IsTransactionState())
>> + return false;
>> +
>> + /* it has to be a subtransaction */
>> + if (!IsSubTransaction())
>> + return false;
>> +
>> + /* the subtransaction has to have a XID assigned */
>> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>> + return false;
>> +
>> + /* and it needs to have 'assigned' */
>> + return !CurrentTransactionState->assigned;
>> +
>> +}
>> IMHO, it's important to reduce the complexity of this function since
>> it's been called for every WAL insertion. During the lifespan of a
>> transaction, any of these if conditions will only be evaluated if
>> previous conditions are true. So, we can maintain some state machine
>> to avoid multiple evaluation of a condition inside a transaction. But,
>> if the overhead is not much, it's not worth I guess.
>
>Yeah maybe, in some cases we can avoid checking multiple conditions by
>maintaining that state. But, that state will have to be at the
>transaction level. But, I am not sure how much worth it will be to
>add one extra condition to skip a few if checks and it will also add
>the code complexity. And, in some cases where logical decoding is not
>enabled, it may add one extra check? I mean first check the state and
>that will take you to the first if check.
>
Perhaps. I think we should only do that if we can demonstrate it's an
issue in practice. Otherwise it's just unnecessary complexity.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-18 09:07:48 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-16 11:46, Erik Rijkers wrote:
> On 2020-04-16 11:33, Dilip Kumar wrote:
>> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>>>
>>> On 2020-04-14 12:10, Dilip Kumar wrote:
>>>
>>> > v14-0001-Immediately-WAL-log-assignments.patch +
>>> > v14-0002-Issue-individual-invalidations-with.patch +
>>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch +
>>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>>> > v14-0007-Track-statistics-for-streaming.patch +
>>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch +
>>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>>>
>>> applied on top of 8128b0c (a few hours ago)
>>
>
> I've added your new patch
>
> [bugfix_replica_identity_full_on_subscriber.patch]
>
> on top of all those above but the crash (apparently the same crash)
> that I had earlier still occurs (and pretty soon).
>
> server process (PID 1721) was terminated by signal 11: Segmentation
> fault
>
> I'll try to isolate it better and get a stacktrace
>
>
>> Hi Erik,
>>
>> While setting up the cascading replication I have hit one issue on
>> base code[1]. After fixing that I have got one crash with streaming
>> on patch. I am not sure whether you are facing any of these 2 issues
>> or any other issue. If your issue is not any of these then plese
>> share the callstack and steps to reproduce.
I figured out a few things about this. Attached is a bash script
test.sh, to reproduce:
There is a variable CRASH_IT that determines whether the whole thing
will fail (with a segmentation fault) or not. As attached it has
CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1, then
it will crash. It turns out that this just depends on a short wait
state (3 seconds, on my machine) between setting up de replication, and
the running of pgbench. It's possible that on very fast machines maybe
it does not occur; we've had such difference between hardware before.
This is a i5-3330S.
It deletes files so look it over before you run it. It may also depend
on some of my local set-up but I guess that should be easily fixed.
Can you let me know if you can reproduce the problem with this?
thanks,
Erik Rijkers
>>
>> [1]
>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-18 09:10:50 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-18 11:07, Erik Rijkers wrote:
>>> Hi Erik,
>>>
>>> While setting up the cascading replication I have hit one issue on
>>> base code[1]. After fixing that I have got one crash with streaming
>>> on patch. I am not sure whether you are facing any of these 2 issues
>>> or any other issue. If your issue is not any of these then plese
>>> share the callstack and steps to reproduce.
>
> I figured out a few things about this. Attached is a bash script
> test.sh, to reproduce:
And the attached file, test.sh. (sorry)
> There is a variable CRASH_IT that determines whether the whole thing
> will fail (with a segmentation fault) or not. As attached it has
> CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1,
> then it will crash. It turns out that this just depends on a short
> wait state (3 seconds, on my machine) between setting up de
> replication, and the running of pgbench. It's possible that on very
> fast machines maybe it does not occur; we've had such difference
> between hardware before. This is a i5-3330S.
>
> It deletes files so look it over before you run it. It may also
> depend on some of my local set-up but I guess that should be easily
> fixed.
>
> Can you let me know if you can reproduce the problem with this?
>
> thanks,
>
> Erik Rijkers
>
>
>
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>>
>>>
>>> --
>>> Regards,
>>> Dilip Kumar
>>> EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
test.sh | text/x-shellscript | 8.3 KB |
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-18 12:42:32 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-18 11:10, Erik Rijkers wrote:
> On 2020-04-18 11:07, Erik Rijkers wrote:
>>>> Hi Erik,
>>>>
>>>> While setting up the cascading replication I have hit one issue on
>>>> base code[1]. After fixing that I have got one crash with streaming
>>>> on patch. I am not sure whether you are facing any of these 2
>>>> issues
>>>> or any other issue. If your issue is not any of these then plese
>>>> share the callstack and steps to reproduce.
>>
>> I figured out a few things about this. Attached is a bash script
>> test.sh, to reproduce:
>
> And the attached file, test.sh. (sorry)
It turns out I must have been mistaken somewhere. I probably missed
bugfix_in_schema_sent.patch)
I have just now rebuilt all the instances on top of master with these
patches:
> [v14-0001-Immediately-WAL-log-assignments.patch]
> [v14-0002-Issue-individual-invalidations-with.patch]
> [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> [v14-0007-Track-statistics-for-streaming.patch]
> [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> [bugfix_in_schema_sent.patch]
(by the way: this build's regression tests 'ddl', 'toast', and
'spill' fail)
I seem now able to run all my test programs on these instances without
errors.
Sorry, I seem to have raised a false alarm (although there was initially
certainly a problem).
Erik Rijkers
>> There is a variable CRASH_IT that determines whether the whole thing
>> will fail (with a segmentation fault) or not. As attached it has
>> CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1,
>> then it will crash. It turns out that this just depends on a short
>> wait state (3 seconds, on my machine) between setting up de
>> replication, and the running of pgbench. It's possible that on very
>> fast machines maybe it does not occur; we've had such difference
>> between hardware before. This is a i5-3330S.
>>
>> It deletes files so look it over before you run it. It may also
>> depend on some of my local set-up but I guess that should be easily
>> fixed.
>>
>> Can you let me know if you can reproduce the problem with this?
>>
>> thanks,
>>
>> Erik Rijkers
>>
>>
>>
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Dilip Kumar
>>>> EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-18 14:01:37 |
Message-ID: | CAFiTN-u5oVS3C_d8QiTMHZHWn++BqqYtPe-i79DRJmW9u5jKtA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1]. After fixing that I have got one crash with streaming
> >>>> on patch. I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue. If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh. (sorry)
>
> It turns out I must have been mistaken somewhere. I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
> (by the way: this build's regression tests 'ddl', 'toast', and
> 'spill' fail)
>
> I seem now able to run all my test programs on these instances without
> errors.
>
> Sorry, I seem to have raised a false alarm (although there was initially
> certainly a problem).
No problem, Thanks for confirming.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-21 12:00:35 |
Message-ID: | CAFiTN-ugQ90mZ7UgvUjfgURa3H3YqPi2tnhx6MArmA1WuTgFww@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1]. After fixing that I have got one crash with streaming
> >>>> on patch. I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue. If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh. (sorry)
>
> It turns out I must have been mistaken somewhere. I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
> (by the way: this build's regression tests 'ddl', 'toast', and
> 'spill' fail)
Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed. I have a plan to work on
this part.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-22 14:49:28 |
Message-ID: | CAFiTN-tE7FNneDfPURWn_d5902WsrbWi_fZoXNjTx+5Fp+OxXw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> >
> > On 2020-04-18 11:10, Erik Rijkers wrote:
> > > On 2020-04-18 11:07, Erik Rijkers wrote:
> > >>>> Hi Erik,
> > >>>>
> > >>>> While setting up the cascading replication I have hit one issue on
> > >>>> base code[1]. After fixing that I have got one crash with streaming
> > >>>> on patch. I am not sure whether you are facing any of these 2
> > >>>> issues
> > >>>> or any other issue. If your issue is not any of these then plese
> > >>>> share the callstack and steps to reproduce.
> > >>
> > >> I figured out a few things about this. Attached is a bash script
> > >> test.sh, to reproduce:
> > >
> > > And the attached file, test.sh. (sorry)
> >
> > It turns out I must have been mistaken somewhere. I probably missed
> > bugfix_in_schema_sent.patch)
> >
> > I have just now rebuilt all the instances on top of master with these
> > patches:
> >
> > > [v14-0001-Immediately-WAL-log-assignments.patch]
> > > [v14-0002-Issue-individual-invalidations-with.patch]
> > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > > [v14-0007-Track-statistics-for-streaming.patch]
> > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > > [bugfix_in_schema_sent.patch]
> >
> > (by the way: this build's regression tests 'ddl', 'toast', and
> > 'spill' fail)
>
> Yeah, this is a. known issue, actually, while streaming the
> transaction the output message is changed. I have a plan to work on
> this part.
I have fixed this part. Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'. So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-22 16:01:23 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-22 16:49, Dilip Kumar wrote:
> On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
>>
>> >
>> > (by the way: this build's regression tests 'ddl', 'toast', and
>> > 'spill' fail)
>>
>> Yeah, this is a. known issue, actually, while streaming the
>> transaction the output message is changed. I have a plan to work on
>> this part.
>
> I have fixed this part. Basically, now, I have created a separate
> function to get the streaming changes
> 'pg_logical_slot_get_streaming_changes'. So the default function
> pg_logical_slot_get_changes will work as it is and test decoding test
> cases will not fail.
The 'ddl' one is apparently not quite fixed - I get this in (cd
contrib; make check)' (in both assert-enabled and non-assert-enabled
build)
grep -A7 -B7 make.check_contrib.out:
contrib/make.check_contrib.out-============== initializing database
system ==============
contrib/make.check_contrib.out-============== starting postmaster
==============
contrib/make.check_contrib.out-running on port 64464 with PID 9175
contrib/make.check_contrib.out-============== creating database
"contrib_regression" ==============
contrib/make.check_contrib.out-CREATE DATABASE
contrib/make.check_contrib.out-ALTER DATABASE
contrib/make.check_contrib.out-============== running regression test
queries ==============
contrib/make.check_contrib.out:test ddl ...
FAILED 840 ms
contrib/make.check_contrib.out-test xact ... ok
24 ms
contrib/make.check_contrib.out-test rewrite ... ok
187 ms
contrib/make.check_contrib.out-test toast ... ok
851 ms
contrib/make.check_contrib.out-test permissions ... ok
26 ms
contrib/make.check_contrib.out-test decoding_in_xact ... ok
31 ms
contrib/make.check_contrib.out-test decoding_into_rel ... ok
25 ms
contrib/make.check_contrib.out-test binary ... ok
12 ms
Otherwise patches apply and build OK so will go run some tests...
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-23 03:24:41 |
Message-ID: | CAFiTN-sb9tCNSc=-dpjMD4iOXEf4BGj+hmWZ2hBa7iNLob_akA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-22 16:49, Dilip Kumar wrote:
> > On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> > wrote:
> >>
> >> >
> >> > (by the way: this build's regression tests 'ddl', 'toast', and
> >> > 'spill' fail)
> >>
> >> Yeah, this is a. known issue, actually, while streaming the
> >> transaction the output message is changed. I have a plan to work on
> >> this part.
> >
> > I have fixed this part. Basically, now, I have created a separate
> > function to get the streaming changes
> > 'pg_logical_slot_get_streaming_changes'. So the default function
> > pg_logical_slot_get_changes will work as it is and test decoding test
> > cases will not fail.
>
> The 'ddl' one is apparently not quite fixed - I get this in (cd
> contrib; make check)' (in both assert-enabled and non-assert-enabled
> build)
Can you send me the contrib/test_decoding/regression.diffs file?
> grep -A7 -B7 make.check_contrib.out:
>
> contrib/make.check_contrib.out-============== initializing database
> system ==============
> contrib/make.check_contrib.out-============== starting postmaster
> ==============
> contrib/make.check_contrib.out-running on port 64464 with PID 9175
> contrib/make.check_contrib.out-============== creating database
> "contrib_regression" ==============
> contrib/make.check_contrib.out-CREATE DATABASE
> contrib/make.check_contrib.out-ALTER DATABASE
> contrib/make.check_contrib.out-============== running regression test
> queries ==============
> contrib/make.check_contrib.out:test ddl ...
> FAILED 840 ms
> contrib/make.check_contrib.out-test xact ... ok
> 24 ms
> contrib/make.check_contrib.out-test rewrite ... ok
> 187 ms
> contrib/make.check_contrib.out-test toast ... ok
> 851 ms
> contrib/make.check_contrib.out-test permissions ... ok
> 26 ms
> contrib/make.check_contrib.out-test decoding_in_xact ... ok
> 31 ms
> contrib/make.check_contrib.out-test decoding_into_rel ... ok
> 25 ms
> contrib/make.check_contrib.out-test binary ... ok
> 12 ms
>
> Otherwise patches apply and build OK so will go run some tests...
Thanks
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-23 08:58:26 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-04-23 05:24, Dilip Kumar wrote:
> On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>>
>> The 'ddl' one is apparently not quite fixed - I get this in (cd
>> contrib; make check)' (in both assert-enabled and non-assert-enabled
>> build)
>
> Can you send me the contrib/test_decoding/regression.diffs file?
Attached.
Below is the patch list, in case that was unclear
20200422/v15-0001-Immediately-WAL-log-assignments.patch
+
20200422/v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch+
20200422/v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
20200422/v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
20200422/v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch
+
20200422/v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
20200422/v15-0007-Track-statistics-for-streaming.patch
+
20200422/v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
+
20200422/v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
+
20200422/v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
+
20200422/v15-0011-Provide-new-api-to-get-the-streaming-changes.patch
+
20200414/bugfix_in_schema_sent.patch
>> grep -A7 -B7 make.check_contrib.out:
>>
>> contrib/make.check_contrib.out-============== initializing database
>> system ==============
>> contrib/make.check_contrib.out-============== starting postmaster
>> ==============
>> contrib/make.check_contrib.out-running on port 64464 with PID 9175
>> contrib/make.check_contrib.out-============== creating database
>> "contrib_regression" ==============
>> contrib/make.check_contrib.out-CREATE DATABASE
>> contrib/make.check_contrib.out-ALTER DATABASE
>> contrib/make.check_contrib.out-============== running regression test
>> queries ==============
>> contrib/make.check_contrib.out:test ddl ...
>> FAILED 840 ms
>> contrib/make.check_contrib.out-test xact ...
>> ok
>> 24 ms
>> contrib/make.check_contrib.out-test rewrite ...
>> ok
>> 187 ms
>> contrib/make.check_contrib.out-test toast ...
>> ok
>> 851 ms
>> contrib/make.check_contrib.out-test permissions ...
>> ok
>> 26 ms
>> contrib/make.check_contrib.out-test decoding_in_xact ...
>> ok
>> 31 ms
>> contrib/make.check_contrib.out-test decoding_into_rel ...
>> ok
>> 25 ms
>> contrib/make.check_contrib.out-test binary ...
>> ok
>> 12 ms
>>
>> Otherwise patches apply and build OK so will go run some tests...
>
> Thanks
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
regression.diffs | text/x-diff | 347.3 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-24 06:24:46 |
Message-ID: | CAFiTN-tXmq1w6qmhRdc9PN50nW1wBAFMub7Ze53EYUxp85Qxwg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> On 2020-04-23 05:24, Dilip Kumar wrote:
> > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> >>
> >> The 'ddl' one is apparently not quite fixed - I get this in (cd
> >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> >> build)
> >
> > Can you send me the contrib/test_decoding/regression.diffs file?
>
> Attached.
So from regression.diff, it appears that in failing in memory
allocation (+ERROR: invalid memory alloc request size
94119198201896). My colleague tried to reproduce this in a different
environment but there is no success so far. One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well. Is it possible for you to somehow stop or
break the code and send the stack trace? One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function. Once we know that we can
convert that error to an assert and find the call stack.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-27 10:34:55 |
Message-ID: | CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, Apr 17, 2020 at 1:40 AM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
>
> Few review comments from 0006-Add-support-for-streaming*.patch
>
> + subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
> lseek can return (-)ve value in case of error, right?
>
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + *
> + * Don't check for error from mkdir; it could fail if the directory
> + * already exists (maybe someone else just did the same thing). If
> + * it doesn't work then we'll bomb out when opening the file
> + */
> + mkdir(tempdirpath, S_IRWXU);
> If that's the only reason, perhaps we can use something like following:
>
> if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
> throw error;
Done
>
> +
> + CloseTransientFile(stream_fd);
> Might failed to close the file. We should handle the case.
Changed
Still, one place is pending because I don't have the filename there to
report an error. One option is we can just give an error without the
filename. I will try to think about this part.
> Also, I think we need some implementations in dumpSubscription() to
> dump the (streaming = 'on') option.
Right, created another patch and attached.
I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-27 10:43:03 |
Message-ID: | CAA4eK1+tFqKuHrtoOMvChD7aWC3EQUnOHxn=gDGWzkJFkGHe3Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have also fixed a couple of bugs internally reported by my colleague
> Neha Sharma.
>
I think it would be good if you can briefly explain what were the bugs
and how you fixed those?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-27 11:26:33 |
Message-ID: | CAFiTN-si3LGf+9B=LQ82pegiRsZ2OXs6S=isz5nX1vtwx9Ds4w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 27, 2020 at 4:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have also fixed a couple of bugs internally reported by my colleague
> > Neha Sharma.
> >
>
> I think it would be good if you can briefly explain what were the bugs
> and how you fixed those?
Issue1: If the concurrent transaction was aborted then in CATCH block
we were not freeing the memory of the toast_has, and it was causing
the assert that after the stream is complete txn->size != 0.
Issue2: After streaming is complete we set the txn->final_lsn and we
remember that in the local variable, But mistakenly it was remembered
in local TRY block variable so if there is a concurrent abort in the
CATCH block the variable value is always a zero. So after streaming
the final_lsn were becoming 0 and that was asserting.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-28 09:41:27 |
Message-ID: | CAA4eK1Jp_SEhLyt9KzNR2iS5oDp6zQFR5su_gVpbdL2OrcqjUA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
[latest patches]
v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
- Any actions leading to transaction ID assignment are prohibited.
That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
bool valid;
/*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+
I think comments and code don't match. In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it. I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").
Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level? There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-28 10:25:20 |
Message-ID: | CAFiTN-uCw3q_0oUODHD4NcjGEo0GRP87Zf08xN6n6wooPtQwWA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> [latest patches]
>
> v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> - Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> + Note that access to user catalog tables or regular system catalog tables
> + in the output plugins has to be done via the
> <literal>systable_*</literal> scan APIs only.
> + Access via the <literal>heap_*</literal> scan APIs will error out.
> + Additionally, any actions leading to transaction ID assignment
> are prohibited. That, among others,
> ..
> @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> bool valid;
>
> /*
> + * We don't expect direct calls to heap_fetch with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> +
>
> I think comments and code don't match. In the comment, we are saying
> that via output plugins access to user catalog tables or regular
> system catalog tables won't be allowed via heap_* APIs but code
> doesn't seem to reflect it. I feel only
> TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> original discussion about this point [1] (Refer "I think it'd also be
> good to add assertions to codepaths not going through systable_*
> asserting that ...").
Right, So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>
> Isn't it better to block the scan to user catalog tables or regular
> system catalog tables for tableam scan APIs rather than at the heap
> level? There might be some APIs like heap_getnext where such a check
> might still be required but I guess it is still better to block at
> tableam level.
>
> [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
Okay, let me analyze this part. Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent. Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.
Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 03:18:42 |
Message-ID: | CAA4eK1+XLKmV9u06zK1T6SchD-a39hHqg3iQ1c4uoHQk3J5DGA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > - Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > + Note that access to user catalog tables or regular system catalog tables
> > + in the output plugins has to be done via the
> > <literal>systable_*</literal> scan APIs only.
> > + Access via the <literal>heap_*</literal> scan APIs will error out.
> > + Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > bool valid;
> >
> > /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match. In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it. I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right, So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>
I am fine with Assertion but update the documentation accordingly.
However, I think you can once cross-verify if there are any output
plugins that are already using such APIs. There is a list of "Logical
Decoding Plugins" on the wiki [1], just look into those once.
> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level? There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part. Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent. Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?
>
Yeah, I don't see the need for such a check (or Assertion) in
heap_finish_speculative.
One additional comment:
---------------------------------------
- Any actions leading to transaction ID assignment are prohibited.
That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
The above text doesn't seem to be aligned properly and you need to
update it if we want to change the error to Assertion for heap APIs
[1] - https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Erik Rijkers <er(at)xs4all(dot)nl>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 05:45:36 |
Message-ID: | CAKYtNAoq7bDZ5Aqzi-MGUj8oMaic4E=3+k_kW6Zx1_hV8dMByQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> >
> > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > >>
> > >> The 'ddl' one is apparently not quite fixed - I get this in (cd
> > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > >> build)
> > >
> > > Can you send me the contrib/test_decoding/regression.diffs file?
> >
> > Attached.
>
> So from regression.diff, it appears that in failing in memory
> allocation (+ERROR: invalid memory alloc request size
> 94119198201896). My colleague tried to reproduce this in a different
> environment but there is no success so far. One more thing surprises
> me is that after
> (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> actually, it should never go for the streaming path. However, we can
> not ignore the fact that some of the changes might impact the
> non-streaming path as well. Is it possible for you to somehow stop or
> break the code and send the stack trace? One idea is by seeing the
> log we can see from where the error is raised i.e MemoryContextAlloc
> or palloc or some other similar function. Once we know that we can
> convert that error to an assert and find the call stack.
>
> --
Thanks Erik for reporting this issue.
I am able to reproduce this issue(+ERROR: invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:
#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
regression.diffs | application/octet-stream | 347.3 KB |
From: | Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Erik Rijkers <er(at)xs4all(dot)nl>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 07:07:19 |
Message-ID: | CAKYtNArya2K5kWKP57=fQMs6+NAwVJk7CDWRkWn6jxD4BM6nig@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> wrote:
>
> On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > >
> > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > > >>
> > > >> The 'ddl' one is apparently not quite fixed - I get this in (cd
> > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > >> build)
> > > >
> > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > >
> > > Attached.
> >
> > So from regression.diff, it appears that in failing in memory
> > allocation (+ERROR: invalid memory alloc request size
> > 94119198201896). My colleague tried to reproduce this in a different
> > environment but there is no success so far. One more thing surprises
> > me is that after
> > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > actually, it should never go for the streaming path. However, we can
> > not ignore the fact that some of the changes might impact the
> > non-streaming path as well. Is it possible for you to somehow stop or
> > break the code and send the stack trace? One idea is by seeing the
> > log we can see from where the error is raised i.e MemoryContextAlloc
> > or palloc or some other similar function. Once we know that we can
> > convert that error to an assert and find the call stack.
> >
> > --
>
> Thanks Erik for reporting this issue.
>
> I am able to reproduce this issue(+ERROR: invalid memory alloc
> request size) on the top of v16 patch set. I applied all patches(12
> patches) of v16 series and then I fired "make check -i" from
> "contrib/test_decoding" folder. Below is stack trace of error:
>
> #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> size=94605581787992) at mcxt.c:806
> #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> reorderbuffer.c:3680
> #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> segno=0x560b1418ad20) at reorderbuffer.c:3564
> #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> command_id=0, streaming=false)
> at reorderbuffer.c:1785
> #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> xid=508, commit_lsn=25986584, end_lsn=25989088,
> commit_time=641449268431600, origin_id=0, origin_lsn=0)
> at reorderbuffer.c:2315
> #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0) at decode.c:261
> #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> at logicalfuncs.c:285
> #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> (setexpr=0x560b14177838, econtext=0x560b14177748,
> argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> randomAccess=false) at execSRF.c:234
> #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> nodeFunctionscan.c:94
> #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:133
> #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:199
> #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> nodeFunctionscan.c:270
> #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> execProcnode.c:450
> #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> ../../../src/include/executor/executor.h:245
> #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> at nodeAgg.c:566
> #19 0x0000560b12e4398f in agg_fill_hash_table
> (aggstate=0x560b14176f40) at nodeAgg.c:2518
> #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> execProcnode.c:450
> #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> ../../../src/include/executor/executor.h:245
> #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> execProcnode.c:450
> #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> ../../../src/include/executor/executor.h:245
> #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> planstate=0x560b14176d28, use_parallel_mode=false,
> operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> direction=ForwardScanDirection, dest=0x560b1419d188,
> execute_once=true) at execMain.c:1646
> #27 0x0000560b12e11a19 in standard_ExecutorRun
> (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> execute_once=true) at execMain.c:364
> #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:308
> #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> count=9223372036854775807, isTopLevel=true, run_once=true,
> dest=0x560b1419d188, altdest=0x560b1419d188,
> qc=0x7ffef18b2350) at pquery.c:756
> #31 0x0000560b131e550b in exec_simple_query (
> query_string=0x560b14076720 "/ display results, but hide most of the
> output /\nSELECT count(*), min(data), max(data)\nFROM
> pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> postgres.c:1239
> #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> "mahendrathalor") at postgres.c:4315
> #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> postmaster.c:4202
> #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> at postmaster.c:1400
> #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
>
> I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> am looking into this issue with Dilip.
This error is due to invalid size.
diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index eed9a5048b..487c1b4252 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
change->data.inval.invalidations =
MemoryContextAlloc(rb->context,
-
change->data.msg.message_size);
+
inval_size);
/* read the message */
memcpy(change->data.inval.invalidations, data, inval_size);
data += inval_size;
Above change, fixes the error. Thanks Dilip for helping.
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> |
Cc: | Erik Rijkers <er(at)xs4all(dot)nl>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 07:20:22 |
Message-ID: | CAFiTN-uY3by6E6pFjN6RuDRMKMiNZQiRL81u5ppzWQ7N3VVBAQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 29, 2020 at 12:37 PM Mahendra Singh Thalor
<mahi6run(at)gmail(dot)com> wrote:
>
> On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> wrote:
> >
> > On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > > >
> > > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > > > >>
> > > > >> The 'ddl' one is apparently not quite fixed - I get this in (cd
> > > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > > >> build)
> > > > >
> > > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > > >
> > > > Attached.
> > >
> > > So from regression.diff, it appears that in failing in memory
> > > allocation (+ERROR: invalid memory alloc request size
> > > 94119198201896). My colleague tried to reproduce this in a different
> > > environment but there is no success so far. One more thing surprises
> > > me is that after
> > > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > > actually, it should never go for the streaming path. However, we can
> > > not ignore the fact that some of the changes might impact the
> > > non-streaming path as well. Is it possible for you to somehow stop or
> > > break the code and send the stack trace? One idea is by seeing the
> > > log we can see from where the error is raised i.e MemoryContextAlloc
> > > or palloc or some other similar function. Once we know that we can
> > > convert that error to an assert and find the call stack.
> > >
> > > --
> >
> > Thanks Erik for reporting this issue.
> >
> > I am able to reproduce this issue(+ERROR: invalid memory alloc
> > request size) on the top of v16 patch set. I applied all patches(12
> > patches) of v16 series and then I fired "make check -i" from
> > "contrib/test_decoding" folder. Below is stack trace of error:
> >
> > #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> > size=94605581787992) at mcxt.c:806
> > #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> > (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> > reorderbuffer.c:3680
> > #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> > (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> > segno=0x560b1418ad20) at reorderbuffer.c:3564
> > #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> > txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> > #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> > txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> > command_id=0, streaming=false)
> > at reorderbuffer.c:1785
> > #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> > xid=508, commit_lsn=25986584, end_lsn=25989088,
> > commit_time=641449268431600, origin_id=0, origin_lsn=0)
> > at reorderbuffer.c:2315
> > #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> > #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0) at decode.c:261
> > #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> > (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> > #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> > (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> > at logicalfuncs.c:285
> > #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> > (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> > #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> > (setexpr=0x560b14177838, econtext=0x560b14177748,
> > argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> > randomAccess=false) at execSRF.c:234
> > #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> > nodeFunctionscan.c:94
> > #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:133
> > #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:199
> > #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> > nodeFunctionscan.c:270
> > #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> > execProcnode.c:450
> > #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> > ../../../src/include/executor/executor.h:245
> > #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> > at nodeAgg.c:566
> > #19 0x0000560b12e4398f in agg_fill_hash_table
> > (aggstate=0x560b14176f40) at nodeAgg.c:2518
> > #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> > #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> > execProcnode.c:450
> > #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> > ../../../src/include/executor/executor.h:245
> > #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> > #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> > execProcnode.c:450
> > #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> > ../../../src/include/executor/executor.h:245
> > #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> > planstate=0x560b14176d28, use_parallel_mode=false,
> > operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> > direction=ForwardScanDirection, dest=0x560b1419d188,
> > execute_once=true) at execMain.c:1646
> > #27 0x0000560b12e11a19 in standard_ExecutorRun
> > (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> > execute_once=true) at execMain.c:364
> > #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> > direction=ForwardScanDirection, count=0, execute_once=true) at
> > execMain.c:308
> > #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> > forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> > #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> > count=9223372036854775807, isTopLevel=true, run_once=true,
> > dest=0x560b1419d188, altdest=0x560b1419d188,
> > qc=0x7ffef18b2350) at pquery.c:756
> > #31 0x0000560b131e550b in exec_simple_query (
> > query_string=0x560b14076720 "/ display results, but hide most of the
> > output /\nSELECT count(*), min(data), max(data)\nFROM
> > pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> > 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> > postgres.c:1239
> > #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> > dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> > "mahendrathalor") at postgres.c:4315
> > #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> > #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> > postmaster.c:4202
> > #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> > #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> > at postmaster.c:1400
> > #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
> >
> > I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> > am looking into this issue with Dilip.
>
> This error is due to invalid size.
>
> diff --git a/src/backend/replication/logical/reorderbuffer.c
> b/src/backend/replication/logical/reorderbuffer.c
> index eed9a5048b..487c1b4252 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>
> change->data.inval.invalidations =
> MemoryContextAlloc(rb->context,
> -
> change->data.msg.message_size);
> +
> inval_size);
> /* read the message */
>
> memcpy(change->data.inval.invalidations, data, inval_size);
> data += inval_size;
>
> Above change, fixes the error. Thanks Dilip for helping.
Thanks, Mahendra for reproducing and help in fixing this. I will
include this change in my next patch set.
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 09:26:49 |
Message-ID: | CAFiTN-uqdp-WoJrWj8_Ftio6Kw+AJ4+BPHyX_M5hwWhchr6-Lg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > - Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > + Note that access to user catalog tables or regular system catalog tables
> > + in the output plugins has to be done via the
> > <literal>systable_*</literal> scan APIs only.
> > + Access via the <literal>heap_*</literal> scan APIs will error out.
> > + Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > bool valid;
> >
> > /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match. In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it. I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right, So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>
> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level? There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part. Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent. Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?
While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert. Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.
Refer below callstack
#0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
all_dead=0x7fff4b6cc89e)
at ../../../../src/include/access/tableam.h:1035
#1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
slot=0x2391f60) at indexam.c:577
#2 0x00000000005101ea in index_getnext_slot (scan=0x2392210,
direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
#3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
#4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
relfilenode=16593) at relfilenodemap.c:213
#5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
command_id=0, streaming=false)
at reorderbuffer.c:1823
#6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
#8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
buf=0x7fff4b6cce30) at decode.c:261
#9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
record=0x22e19a0) at decode.c:130
So basically, the problem is that we can not distinguish whether the
tableam/heap routine is called directly or via systable_*.
Now I understand the current code was actually giving error for the
user table not the system table with the assumption that the system
table will come to this function only via systable_*. Only user table
can come directly. So if this is not a system table i.e. we reach
here directly so error out. Now, I am not sure if it is not for the
system table then what is the purpose of throwing that error?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-29 09:49:15 |
Message-ID: | CAFiTN-uOFXXNbVwyEk-syUUrMSbPKXDeK-1aJNBhLzqDx2xGNw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > [latest patches]
> > >
> > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > - Any actions leading to transaction ID assignment are prohibited.
> > > That, among others,
> > > + Note that access to user catalog tables or regular system catalog tables
> > > + in the output plugins has to be done via the
> > > <literal>systable_*</literal> scan APIs only.
> > > + Access via the <literal>heap_*</literal> scan APIs will error out.
> > > + Additionally, any actions leading to transaction ID assignment
> > > are prohibited. That, among others,
> > > ..
> > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > bool valid;
> > >
> > > /*
> > > + * We don't expect direct calls to heap_fetch with valid
> > > + * CheckXidAlive for regular tables. Track that below.
> > > + */
> > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > +
> > >
> > > I think comments and code don't match. In the comment, we are saying
> > > that via output plugins access to user catalog tables or regular
> > > system catalog tables won't be allowed via heap_* APIs but code
> > > doesn't seem to reflect it. I feel only
> > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> > > original discussion about this point [1] (Refer "I think it'd also be
> > > good to add assertions to codepaths not going through systable_*
> > > asserting that ...").
> >
> > Right, So I think we can just add an assert in these function that
> > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> >
> > >
> > > Isn't it better to block the scan to user catalog tables or regular
> > > system catalog tables for tableam scan APIs rather than at the heap
> > > level? There might be some APIs like heap_getnext where such a check
> > > might still be required but I guess it is still better to block at
> > > tableam level.
> > >
> > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> >
> > Okay, let me analyze this part. Because someplace we have to keep at
> > heap level like heap_getnext and other places at tableam level so it
> > seems a bit inconsistent. Also, I think the number of checks might
> > going to increase because some of the heap functions like
> > heap_hot_search_buffer are being called from multiple tableam calls,
> > so we need to put check at every place.
> >
> > Another point is that I feel some of the checks what we have today
> > might not be required like heap_finish_speculative, is not fetching
> > any tuple for us so why do we need to care about this function?
>
> While testing these changes, I have noticed that the systable_* APIs
> internally, calls tableam apis and so if we just put assert
> Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> that assert. Whether we put these assert in heap APIs or the tableam
> APIs because systable_ always access heap through tableam APIs.
>
> Refer below callstack
> #0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
> snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
> all_dead=0x7fff4b6cc89e)
> at ../../../../src/include/access/tableam.h:1035
> #1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
> slot=0x2391f60) at indexam.c:577
> #2 0x00000000005101ea in index_getnext_slot (scan=0x2392210,
> direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
> #3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
> #4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
> relfilenode=16593) at relfilenodemap.c:213
> #5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
> txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
> command_id=0, streaming=false)
> at reorderbuffer.c:1823
> #6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
> commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
> origin_id=0, origin_lsn=0)
> at reorderbuffer.c:2315
> #7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
> buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
> #8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
> buf=0x7fff4b6cce30) at decode.c:261
> #9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
> record=0x22e19a0) at decode.c:130
>
> So basically, the problem is that we can not distinguish whether the
> tableam/heap routine is called directly or via systable_*.
>
> Now I understand the current code was actually giving error for the
> user table not the system table with the assumption that the system
> table will come to this function only via systable_*. Only user table
> can come directly. So if this is not a system table i.e. we reach
> here directly so error out. Now, I am not sure if it is not for the
> system table then what is the purpose of throwing that error?
Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin. And, our main issue with the concurrent abort is that we
shall not get the wrong catalog entry for decoding our tuple. So if
we are always getting our relation entry using RelationIdGetRelation
then why should we bother about how output plugin is accessing
system/user relations?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-04-30 07:00:49 |
Message-ID: | CAA4eK1KQUoFyxh0WO0NZJ2pQj+ze9M35VS7RZKU4wbJfk1Mv0g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > > > [latest patches]
> > > >
> > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > - Any actions leading to transaction ID assignment are prohibited.
> > > > That, among others,
> > > > + Note that access to user catalog tables or regular system catalog tables
> > > > + in the output plugins has to be done via the
> > > > <literal>systable_*</literal> scan APIs only.
> > > > + Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > + Additionally, any actions leading to transaction ID assignment
> > > > are prohibited. That, among others,
> > > > ..
> > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > > bool valid;
> > > >
> > > > /*
> > > > + * We don't expect direct calls to heap_fetch with valid
> > > > + * CheckXidAlive for regular tables. Track that below.
> > > > + */
> > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > +
> > > >
> > > > I think comments and code don't match. In the comment, we are saying
> > > > that via output plugins access to user catalog tables or regular
> > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > doesn't seem to reflect it. I feel only
> > > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > good to add assertions to codepaths not going through systable_*
> > > > asserting that ...").
> > >
> > > Right, So I think we can just add an assert in these function that
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > >
> > > >
> > > > Isn't it better to block the scan to user catalog tables or regular
> > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > level? There might be some APIs like heap_getnext where such a check
> > > > might still be required but I guess it is still better to block at
> > > > tableam level.
> > > >
> > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > >
> > > Okay, let me analyze this part. Because someplace we have to keep at
> > > heap level like heap_getnext and other places at tableam level so it
> > > seems a bit inconsistent. Also, I think the number of checks might
> > > going to increase because some of the heap functions like
> > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > so we need to put check at every place.
> > >
> > > Another point is that I feel some of the checks what we have today
> > > might not be required like heap_finish_speculative, is not fetching
> > > any tuple for us so why do we need to care about this function?
> >
> > While testing these changes, I have noticed that the systable_* APIs
> > internally, calls tableam apis and so if we just put assert
> > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > that assert. Whether we put these assert in heap APIs or the tableam
> > APIs because systable_ always access heap through tableam APIs.
> >
..
..
>
> Putting some more thought upon this, I am just wondering what do we
> really want any such check because, we are always getting relation
> description from the reorder buffer code, not from the pgoutput
> plugin.
>
But can't they access other catalogs like pg_publication*? I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs. We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-01 15:10:51 |
Message-ID: | CAFiTN-t7W_BTz4kESw5NA+F_9SwhPc5KMUhxTvswqQ4goz_fvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > > >
> > > > > [latest patches]
> > > > >
> > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > > - Any actions leading to transaction ID assignment are prohibited.
> > > > > That, among others,
> > > > > + Note that access to user catalog tables or regular system catalog tables
> > > > > + in the output plugins has to be done via the
> > > > > <literal>systable_*</literal> scan APIs only.
> > > > > + Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > > + Additionally, any actions leading to transaction ID assignment
> > > > > are prohibited. That, among others,
> > > > > ..
> > > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > > > bool valid;
> > > > >
> > > > > /*
> > > > > + * We don't expect direct calls to heap_fetch with valid
> > > > > + * CheckXidAlive for regular tables. Track that below.
> > > > > + */
> > > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > > +
> > > > >
> > > > > I think comments and code don't match. In the comment, we are saying
> > > > > that via output plugins access to user catalog tables or regular
> > > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > > doesn't seem to reflect it. I feel only
> > > > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the
> > > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > > good to add assertions to codepaths not going through systable_*
> > > > > asserting that ...").
> > > >
> > > > Right, So I think we can just add an assert in these function that
> > > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > > >
> > > > >
> > > > > Isn't it better to block the scan to user catalog tables or regular
> > > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > > level? There might be some APIs like heap_getnext where such a check
> > > > > might still be required but I guess it is still better to block at
> > > > > tableam level.
> > > > >
> > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > > >
> > > > Okay, let me analyze this part. Because someplace we have to keep at
> > > > heap level like heap_getnext and other places at tableam level so it
> > > > seems a bit inconsistent. Also, I think the number of checks might
> > > > going to increase because some of the heap functions like
> > > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > > so we need to put check at every place.
> > > >
> > > > Another point is that I feel some of the checks what we have today
> > > > might not be required like heap_finish_speculative, is not fetching
> > > > any tuple for us so why do we need to care about this function?
> > >
> > > While testing these changes, I have noticed that the systable_* APIs
> > > internally, calls tableam apis and so if we just put assert
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > > that assert. Whether we put these assert in heap APIs or the tableam
> > > APIs because systable_ always access heap through tableam APIs.
> > >
> ..
> ..
> >
> > Putting some more thought upon this, I am just wondering what do we
> > really want any such check because, we are always getting relation
> > description from the reorder buffer code, not from the pgoutput
> > plugin.
> >
>
> But can't they access other catalogs like pg_publication*? I think
> the basic thing we want to ensure here is that all historic accesses
> always use systable* APIs to access catalogs. We can ensure that via
> having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
Yeah, it can. So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag. So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.
Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma. That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset. This patch also includes the fix for the issue raised by Erik.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-04 11:46:44 |
Message-ID: | CAA4eK1L9cxSYzrzx3t1mwkp6raB1Cb+jwJM_kQ5R1Yg3XoZ32g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > But can't they access other catalogs like pg_publication*? I think
> > the basic thing we want to ensure here is that all historic accesses
> > always use systable* APIs to access catalogs. We can ensure that via
> > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
>
> Yeah, it can. So I have changed it now, actually along with
> CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> set and we pass through systable_beginscan we will set that flag. So
> while accessing the tableam API we will set if CheckXidLive is set
> then another flag must also be set otherwise we through an error.
>
Okay, I have reviewed these changes and below are my comments:
Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
--------------------------------------------------------------------
1.
+ /*
+ * If CheckXidAlive is set then set a flag that this call is passed through
+ * systable_beginscan. See detailed comments at snapmgr.c where these
+ * variables are declared.
+ */
+ if (TransactionIdIsValid(CheckXidAlive))
+ sysbegin_called = true;
a. How about calling this variable as bsysscan or sysscan instead of
sysbegin_called?
b. There is an extra space between detailed and comments. A similar
change is required at other place where this comment is used.
c. How about writing the first line as "If CheckXidAlive is set then
set a flag to indicate that system table scan is in-progress."
2.
- Any actions leading to transaction ID assignment are prohibited.
That, among others,
- includes writing to tables, performing DDL changes, and
- calling <literal>pg_current_xact_id()</literal>.
+ Note that access to user catalog tables or regular system
catalog tables in
+ the output plugins has to be done via the
<literal>systable_*</literal> scan
+ APIs only. The user tables should not be accesed in the output
plugins anyways.
+ Access via the <literal>heap_*</literal> scan APIs will error out.
The line "The user tables should not be accesed in the output plugins
anyways." seems a bit of out of place. I don't think this is required
here. If you read the previous paragraph in the same document it is
written: "Read only access to relations is permitted as long as only
relations are accessed that either have been created by
<command>initdb</command> in the <literal>pg_catalog</literal> schema,
or have been marked as user provided catalog tables using ...". I
think that is sufficient to convey the information that the newly
added line by you is trying to convey.
3.
+ /*
+ * We don't expect direct calls to this routine when CheckXidAlive is a
+ * valid transaction id, this should only come through systable_* call.
+ * CheckXidAlive is set during logical decoding of a transactions.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+ elog(ERROR, "unexpected heap_getnext call during logical decoding");
How about changing this comment as "We don't expect direct calls to
heap_getnext with valid CheckXidAlive for catalog or regular tables.
See detailed comments at snapmgr.c where these variables are
declared."? Change the similar comment used in other places in the
patch.
For this specific API, we can also say "Normally we have such a check
at tableam level API but this is called from many places so we need to
ensure it here."
4.
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out. We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted. See detailed comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()
Can we change the comments as "Error out, if CheckXidAlive is aborted.
We can't directly use TransactionIdDidAbort as after crash such
transaction might not have been marked as aborted."
After this add one empty line and then we can say something like:
"This is a special API to check if CheckXidAlive is aborted in system
table scan APIs. See detailed comments at snapmgr.c where the
variable is declared."
5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?
6.
/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding. It's possible that such transactions
+ * can get aborted while the decoding is ongoing. If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan. This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;
Can we change the above comment as "CheckXidAlive is a xid value
pointing to a possibly ongoing (sub)transaction. Currently, it is
used in logical decoding. It's possible that such transactions can
get aborted while the decoding is ongoing in which case we skip
decoding that particular transaction. To ensure that we check whether
the CheckXidAlive is aborted after fetching the tuple from system
tables. We also ensure that during logical decoding we never directly
access the tableam or heap APIs because we are checking for the
concurrent aborts only in systable_* APIs."
> Apart from this, I have also fixed one defect raised by my colleague
> Neha Sharma. That issue is the incomplete toast tuple flag was not
> reset when the main table tuple was inserted through speculative
> insert and due to that data was not streamed even if later we were
> getting speculative confirm because incomplete toast flag was never
> reset. This patch also includes the fix for the issue raised by Erik.
>
It would be better if you can mention which all patches contain the
changes as it will be easier to review the fix.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 03:57:06 |
Message-ID: | CAFiTN-t22MFzikfPi6vw4JJFds-JrTPs_GeS90rDUou3JzauiQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> 5. Shouldn't we add a check in table_scan_sample_next_block and
> table_scan_sample_next_tuple APIs as well?
I am not sure that we need to do that, Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple. But, I don't think that
table_scan_sample falls under that category.
> > Apart from this, I have also fixed one defect raised by my colleague
> > Neha Sharma. That issue is the incomplete toast tuple flag was not
> > reset when the main table tuple was inserted through speculative
> > insert and due to that data was not streamed even if later we were
> > getting speculative confirm because incomplete toast flag was never
> > reset. This patch also includes the fix for the issue raised by Erik.
> >
>
> It would be better if you can mention which all patches contain the
> changes as it will be easier to review the fix.
Fix1: v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Fix2: patch: v17-0002-Issue-individual-invalidations-with-wal_level-lo.patch
I will work on other comments and send the updated patch.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 04:54:42 |
Message-ID: | CAA4eK1L+3zpVMH_dF=fYBsOE-6+BYGY--UhSW5_0HCyKBh2sLw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
>
> > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > table_scan_sample_next_tuple APIs as well?
>
> I am not sure that we need to do that, Because generally, we want to
> avoid getting any wrong system table tuple which we can use for taking
> some decision or decode tuple. But, I don't think that
> table_scan_sample falls under that category.
>
Hmm, I am asking a check similar to what you have in function
table_scan_bitmap_next_block(), can't we have that one? BTW, I
noticed a below spurious line removal in the patch we are talking
about.
+/*
* These are updated by GetSnapshotData. We initialize them this way
* for the convenience of TransactionIdIsInProgress: even in bootstrap
* mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
historic_snapshot, HTAB *tuplecids)
tuplecid_data = tuplecids;
}
-
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 05:01:12 |
Message-ID: | CAFiTN-u7jWrgE-uH_X0SSM6SohV+kqEwAxX_L60p25m-DeRzxw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> >
> > > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > > table_scan_sample_next_tuple APIs as well?
> >
> > I am not sure that we need to do that, Because generally, we want to
> > avoid getting any wrong system table tuple which we can use for taking
> > some decision or decode tuple. But, I don't think that
> > table_scan_sample falls under that category.
> >
>
> Hmm, I am asking a check similar to what you have in function
> table_scan_bitmap_next_block(), can't we have that one?
Yeah we can put that and there is no harm in that, but my point is
the table_scan_bitmap_next_block and other functions where I have put
the check are used for fetching the tuple which can be used for
decoding tuple or taking some decision, but IMHO,
table_scan_sample_next_tuple is only used for analyzing the table. So
do we really need to do that? Am I missing something here?
BTW, I
> noticed a below spurious line removal in the patch we are talking
> about.
>
> +/*
> * These are updated by GetSnapshotData. We initialize them this way
> * for the convenience of TransactionIdIsInProgress: even in bootstrap
> * mode, we don't want it to say that BootstrapTransactionId is in progress.
> @@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
> historic_snapshot, HTAB *tuplecids)
> tuplecid_data = tuplecids;
> }
>
> -
Okay, I will take care. of this.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 05:31:29 |
Message-ID: | CAA4eK1K3_AQemGNAksFYOCTz-6i5ErZM072VyW2tVweopeHKYg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 5, 2020 at 10:31 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > > >
> > >
> > > > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > > > table_scan_sample_next_tuple APIs as well?
> > >
> > > I am not sure that we need to do that, Because generally, we want to
> > > avoid getting any wrong system table tuple which we can use for taking
> > > some decision or decode tuple. But, I don't think that
> > > table_scan_sample falls under that category.
> > >
> >
> > Hmm, I am asking a check similar to what you have in function
> > table_scan_bitmap_next_block(), can't we have that one?
>
> Yeah we can put that and there is no harm in that, but my point is
> the table_scan_bitmap_next_block and other functions where I have put
> the check are used for fetching the tuple which can be used for
> decoding tuple or taking some decision, but IMHO,
> table_scan_sample_next_tuple is only used for analyzing the table.
>
These will be used in TABLESAMPLE scan. Try something like "select c1
from t1 TABLESAMPLE BERNOULLI(30);". So, I guess these APIs can also
be used to fetch the tuple.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 10:36:17 |
Message-ID: | CAFiTN-v57CCRVhjOHw7Ouqce8NW=1dxaJVsaf6XrkUsPRqZZzw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > But can't they access other catalogs like pg_publication*? I think
> > > the basic thing we want to ensure here is that all historic accesses
> > > always use systable* APIs to access catalogs. We can ensure that via
> > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> >
> > Yeah, it can. So I have changed it now, actually along with
> > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > set and we pass through systable_beginscan we will set that flag. So
> > while accessing the tableam API we will set if CheckXidLive is set
> > then another flag must also be set otherwise we through an error.
> >
>
> Okay, I have reviewed these changes and below are my comments:
>
> Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> --------------------------------------------------------------------
> 1.
> + /*
> + * If CheckXidAlive is set then set a flag that this call is passed through
> + * systable_beginscan. See detailed comments at snapmgr.c where these
> + * variables are declared.
> + */
> + if (TransactionIdIsValid(CheckXidAlive))
> + sysbegin_called = true;
>
> a. How about calling this variable as bsysscan or sysscan instead of
> sysbegin_called?
Done
> b. There is an extra space between detailed and comments. A similar
> change is required at other place where this comment is used.
Done
> c. How about writing the first line as "If CheckXidAlive is set then
> set a flag to indicate that system table scan is in-progress."
>
> 2.
> - Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> - includes writing to tables, performing DDL changes, and
> - calling <literal>pg_current_xact_id()</literal>.
> + Note that access to user catalog tables or regular system
> catalog tables in
> + the output plugins has to be done via the
> <literal>systable_*</literal> scan
> + APIs only. The user tables should not be accesed in the output
> plugins anyways.
> + Access via the <literal>heap_*</literal> scan APIs will error out.
>
> The line "The user tables should not be accesed in the output plugins
> anyways." seems a bit of out of place. I don't think this is required
> here. If you read the previous paragraph in the same document it is
> written: "Read only access to relations is permitted as long as only
> relations are accessed that either have been created by
> <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> or have been marked as user provided catalog tables using ...". I
> think that is sufficient to convey the information that the newly
> added line by you is trying to convey.
Right.
>
> 3.
> + /*
> + * We don't expect direct calls to this routine when CheckXidAlive is a
> + * valid transaction id, this should only come through systable_* call.
> + * CheckXidAlive is set during logical decoding of a transactions.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> + elog(ERROR, "unexpected heap_getnext call during logical decoding");
>
> How about changing this comment as "We don't expect direct calls to
> heap_getnext with valid CheckXidAlive for catalog or regular tables.
> See detailed comments at snapmgr.c where these variables are
> declared."? Change the similar comment used in other places in the
> patch.
>
> For this specific API, we can also say "Normally we have such a check
> at tableam level API but this is called from many places so we need to
> ensure it here."
Done
>
> 4.
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> + * out. We can't directly use TransactionIdDidAbort as after crash such
> + * transaction might not have been marked as aborted. See detailed comments
> + * at snapmgr.c where the variable is declared.
> + */
> +static inline void
> +HandleConcurrentAbort()
>
> Can we change the comments as "Error out, if CheckXidAlive is aborted.
> We can't directly use TransactionIdDidAbort as after crash such
> transaction might not have been marked as aborted."
>
> After this add one empty line and then we can say something like:
> "This is a special API to check if CheckXidAlive is aborted in system
> table scan APIs. See detailed comments at snapmgr.c where the
> variable is declared."
>
> 5. Shouldn't we add a check in table_scan_sample_next_block and
> table_scan_sample_next_tuple APIs as well?
Done
> 6.
> /*
> + * An xid value pointing to a possibly ongoing (sub)transaction.
> + * Currently used in logical decoding. It's possible that such transactions
> + * can get aborted while the decoding is ongoing. If CheckXidAlive is set
> + * then we will set sysbegin_called flag when we call systable_beginscan. This
> + * is to ensure that from the pgoutput plugin we should never directly access
> + * the tableam or heap apis because we are checking for the concurrent abort
> + * only in systable_* apis.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
> +bool sysbegin_called = false;
>
> Can we change the above comment as "CheckXidAlive is a xid value
> pointing to a possibly ongoing (sub)transaction. Currently, it is
> used in logical decoding. It's possible that such transactions can
> get aborted while the decoding is ongoing in which case we skip
> decoding that particular transaction. To ensure that we check whether
> the CheckXidAlive is aborted after fetching the tuple from system
> tables. We also ensure that during logical decoding we never directly
> access the tableam or heap APIs because we are checking for the
> concurrent aborts only in systable_* APIs."
Done
I have also fixed one issue in the patch
v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.
Basically, the check, in ReorderBufferLargestTopTXN for selecting the
largest top transaction was incorrect so I have fixed that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-05 13:43:47 |
Message-ID: | CAFiTN-sOqY_femHNQDAZK7QaWN9WG_vfHhtSTtL5w+3niQCKjQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 5, 2020 at 4:06 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > But can't they access other catalogs like pg_publication*? I think
> > > > the basic thing we want to ensure here is that all historic accesses
> > > > always use systable* APIs to access catalogs. We can ensure that via
> > > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> > >
> > > Yeah, it can. So I have changed it now, actually along with
> > > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > > set and we pass through systable_beginscan we will set that flag. So
> > > while accessing the tableam API we will set if CheckXidLive is set
> > > then another flag must also be set otherwise we through an error.
> > >
> >
> > Okay, I have reviewed these changes and below are my comments:
> >
> > Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > --------------------------------------------------------------------
> > 1.
> > + /*
> > + * If CheckXidAlive is set then set a flag that this call is passed through
> > + * systable_beginscan. See detailed comments at snapmgr.c where these
> > + * variables are declared.
> > + */
> > + if (TransactionIdIsValid(CheckXidAlive))
> > + sysbegin_called = true;
> >
> > a. How about calling this variable as bsysscan or sysscan instead of
> > sysbegin_called?
>
> Done
>
> > b. There is an extra space between detailed and comments. A similar
> > change is required at other place where this comment is used.
>
> Done
>
> > c. How about writing the first line as "If CheckXidAlive is set then
> > set a flag to indicate that system table scan is in-progress."
> >
> > 2.
> > - Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > - includes writing to tables, performing DDL changes, and
> > - calling <literal>pg_current_xact_id()</literal>.
> > + Note that access to user catalog tables or regular system
> > catalog tables in
> > + the output plugins has to be done via the
> > <literal>systable_*</literal> scan
> > + APIs only. The user tables should not be accesed in the output
> > plugins anyways.
> > + Access via the <literal>heap_*</literal> scan APIs will error out.
> >
> > The line "The user tables should not be accesed in the output plugins
> > anyways." seems a bit of out of place. I don't think this is required
> > here. If you read the previous paragraph in the same document it is
> > written: "Read only access to relations is permitted as long as only
> > relations are accessed that either have been created by
> > <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> > or have been marked as user provided catalog tables using ...". I
> > think that is sufficient to convey the information that the newly
> > added line by you is trying to convey.
>
> Right.
>
> >
> > 3.
> > + /*
> > + * We don't expect direct calls to this routine when CheckXidAlive is a
> > + * valid transaction id, this should only come through systable_* call.
> > + * CheckXidAlive is set during logical decoding of a transactions.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> > + elog(ERROR, "unexpected heap_getnext call during logical decoding");
> >
> > How about changing this comment as "We don't expect direct calls to
> > heap_getnext with valid CheckXidAlive for catalog or regular tables.
> > See detailed comments at snapmgr.c where these variables are
> > declared."? Change the similar comment used in other places in the
> > patch.
> >
> > For this specific API, we can also say "Normally we have such a check
> > at tableam level API but this is called from many places so we need to
> > ensure it here."
>
> Done
>
> >
> > 4.
> > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> > + * out. We can't directly use TransactionIdDidAbort as after crash such
> > + * transaction might not have been marked as aborted. See detailed comments
> > + * at snapmgr.c where the variable is declared.
> > + */
> > +static inline void
> > +HandleConcurrentAbort()
> >
> > Can we change the comments as "Error out, if CheckXidAlive is aborted.
> > We can't directly use TransactionIdDidAbort as after crash such
> > transaction might not have been marked as aborted."
> >
> > After this add one empty line and then we can say something like:
> > "This is a special API to check if CheckXidAlive is aborted in system
> > table scan APIs. See detailed comments at snapmgr.c where the
> > variable is declared."
> >
> > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > table_scan_sample_next_tuple APIs as well?
>
> Done
>
> > 6.
> > /*
> > + * An xid value pointing to a possibly ongoing (sub)transaction.
> > + * Currently used in logical decoding. It's possible that such transactions
> > + * can get aborted while the decoding is ongoing. If CheckXidAlive is set
> > + * then we will set sysbegin_called flag when we call systable_beginscan. This
> > + * is to ensure that from the pgoutput plugin we should never directly access
> > + * the tableam or heap apis because we are checking for the concurrent abort
> > + * only in systable_* apis.
> > + */
> > +TransactionId CheckXidAlive = InvalidTransactionId;
> > +bool sysbegin_called = false;
> >
> > Can we change the above comment as "CheckXidAlive is a xid value
> > pointing to a possibly ongoing (sub)transaction. Currently, it is
> > used in logical decoding. It's possible that such transactions can
> > get aborted while the decoding is ongoing in which case we skip
> > decoding that particular transaction. To ensure that we check whether
> > the CheckXidAlive is aborted after fetching the tuple from system
> > tables. We also ensure that during logical decoding we never directly
> > access the tableam or heap APIs because we are checking for the
> > concurrent aborts only in systable_* APIs."
>
> Done
>
> I have also fixed one issue in the patch
> v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.
>
> Basically, the check, in ReorderBufferLargestTopTXN for selecting the
> largest top transaction was incorrect so I have fixed that.
There was one unrelated bug fix in v18-0010 patch reported by Neha
Sharma offlist so sending the updated version.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-07 12:46:44 |
Message-ID: | CAFiTN-uQeYHgyia7+XPhTDAwe7NBeXzt2cT2N6gBghHXA=XT9Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
I have fixed one more issue in 0010 patch. The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-12 11:09:35 |
Message-ID: | CAA4eK1JMx9aWZiEqfAFdavj7YxriDstR0rdkazM5b6eV0zoMLQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> I have fixed one more issue in 0010 patch. The issue was that once
> the transaction is serialized due to the incomplete toast after
> streaming the serialized store was not cleaned up so it was streaming
> the same tuple multiple times.
>
I have reviewed a few patches (003, 004, and 005) and below are my comments.
v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+ int nrelations, Relation relations[],
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
In the above and similar APIs, there are parameters like relation
which are not used. I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.
2.
+ <para>
+ Similar to spill-to-disk behavior, streaming is triggered when the total
+ amount of changes decoded from the WAL (for all in-progress transactions)
+ exceeds limit defined by
<varname>logical_decoding_work_mem</varname> setting.
+ At that point the largest toplevel transaction (measured by
amount of memory
+ currently used for decoded changes) is selected and streamed.
+ </para>
I think we need to explain here the cases/exception where we need to
spill even when stream is enabled and check if this is per latest
implementation, otherwise, update it.
3.
+ * To support streaming, we require change/commit/abort callbacks. The
+ * message callback is optional, similarly to regular output plugins.
/similarly/similar
4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */
Why can't we supply the report_location here? I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.
5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */
Can't we report txn->final_lsn here?
6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.
v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
----------------------------------------------------------------------------------------
7.
+ /*
+ * We don't expect direct calls to table_tuple_get_latest_tid with valid
+ * CheckXidAlive for catalog or regular tables.
There is an extra space between 'CheckXidAlive' and 'for'. I can see
similar problems in other places as well where this comment is used,
fix those as well.
8.
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction. Currently, it is used in logical decoding. It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables. We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */
In this comment, there is an inconsistency in the space used after
completing the sentence. In the part "transaction. To", single space
is used whereas at other places two spaces are used after a full stop.
v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
9.
Implement streaming mode in ReorderBuffer
Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.
I think the above part of the commit message needs to be updated.
10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).
I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming. Please check the
entire commit message and update according to the latest
implementation.
11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
*/
static void
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
dlist_iter iter;
HASHCTL hash_ctl;
- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-
I don't understand this change. Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID? The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction. We might want
to generate a test to once validate this behavior.
Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.
12.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access. Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}
In the comment, the flag name 'sysbegin_called' should be bsysscan.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-13 06:04:42 |
Message-ID: | CAFiTN-v6rzMAmtyh=JuaM9eT8QmPeRn1=x30s9x-i_p=z7rPeQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have fixed one more issue in 0010 patch. The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.
Thanks for the review, I am replying some of the comments where I have
confusion, others are fine.
>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> ----------------------------------------------------------------------------------------
> 1.
> +static void
> +pg_decode_stream_change(LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + Relation relation,
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
> +
> +static void
> +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> + int nrelations, Relation relations[],
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
>
> In the above and similar APIs, there are parameters like relation
> which are not used. I think you should add some comments atop these
> APIs to explain why it is so? I guess it is because we want to keep
> them similar to non-stream version of APIs and we can't display
> relation or other information as the transaction is still in-progress.
I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here. We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used. Do you think we
still need to add comments?
> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here? I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.
I am not sure about this, Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.
>
> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
> */
> static void
> ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
> dlist_iter iter;
> HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change. Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID?
Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation. e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple. So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.
The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction. We might want
> to generate a test to once validate this behavior.
We have already added the test case for the same, 011_stream_ddl.pl in
test/subscription
> Also, the comment refers to tqual.c which is wrong as this API is now
> in heapam_visibility.c.
Ok, will fix.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-13 11:20:36 |
Message-ID: | CAA4eK1Jim-3BnTH_Vt=nvuzLGKacyuHKFmPdssoJ5Go0MnqRBg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > ----------------------------------------------------------------------------------------
> > 1.
> > +static void
> > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > + ReorderBufferTXN *txn,
> > + Relation relation,
> > + ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> > +
> > +static void
> > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > + int nrelations, Relation relations[],
> > + ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> >
> > In the above and similar APIs, there are parameters like relation
> > which are not used. I think you should add some comments atop these
> > APIs to explain why it is so? I guess it is because we want to keep
> > them similar to non-stream version of APIs and we can't display
> > relation or other information as the transaction is still in-progress.
>
> I think because the interfaces are designed that way because other
> decoding plugins might need it e.g. in pgoutput we need change and
> relation but not here. We have other similar examples also e.g.
> pg_decode_message has the parameter txn but not used. Do you think we
> still need to add comments?
>
In that case, we can leave but lets ensure that we are not exposing
any parameter which is not used and if there is any due to some
reason, we should document it. I will also look into this.
> > 4.
> > +static void
> > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_start";
> > + /* state.report_location = apply_lsn; */
> >
> > Why can't we supply the report_location here? I think here we need to
> > report txn->first_lsn if this is the very first stream and
> > txn->final_lsn if it is any consecutive one.
>
> I am not sure about this, Because for the very first stream we will
> report the location of the first lsn of the stream and for the
> consecutive stream we will report the last lsn in the stream.
>
Yeah, that doesn't seem to be consistent. How about if get it as an
additional parameter? The caller can pass the lsn of the very first
change it is trying to decode in this stream.
> >
> > 11.
> > - * HeapTupleSatisfiesHistoricMVCC.
> > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > + *
> > + * We do build the hash table even if there are no CIDs. That's
> > + * because when streaming in-progress transactions we may run into
> > + * tuples with the CID before actually decoding them. Think e.g. about
> > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > + * yet when applying the INSERT. So we build a hash table so that
> > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > + *
> > + * XXX We might limit this behavior to streaming mode, and just bail
> > + * out when decoding transaction at commit time (at which point it's
> > + * guaranteed to see all CIDs).
> > */
> > static void
> > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > *rb, ReorderBufferTXN *txn)
> > dlist_iter iter;
> > HASHCTL hash_ctl;
> >
> > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > - return;
> > -
> >
> > I don't understand this change. Why would "INSERT followed by
> > TRUNCATE" could lead to a tuple which can come for decode before its
> > CID?
>
> Actually, even if we haven't decoded the DDL operation but in the
> actual system table the tuple might have been deleted from the next
> operation. e.g. while we are streaming the INSERT it is possible that
> the truncate has already deleted that tuple and set the max for the
> tuple. So before streaming patch, we were only streaming the INSERT
> only on commit so by that time we had got all the operation which has
> done DDL and we would have already prepared tuple CID hash.
>
Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
while resolving the cmin/cmax?
Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
----------------------------------------------------------------------------------------------------------------
1.
/*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
*/
static int
ReorderBufferIterCompare(Datum a, Datum b, void *arg)
It seems to me the above comment change is not required as per the latest patch.
2.
* For subtransactions, we only mark them as streamed when there are
+ * any changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts
+ * for XIDs the downstream is not aware of. And of course, it always
+ * knows about the toplevel xact (we send the XID in all messages),
+ * but we never stream XIDs of empty subxacts.
+ */
+ if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;
/when there are any changes in them/when there are changes in them. I
think we don't need 'any' in the above sentence.
3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore. We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)
In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction. We need to say that
we discard the already streamed changes on such an error.
4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
/*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode. Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access. Also reset the
+ * sysbegin_called flag.
*/
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
{
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
}
I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?
5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access. Also reset the
+ * sysbegin_called flag.
/if the xid aborted/if the xid is aborted. missing comma after Also.
6.
ReorderBufferProcessTXN()
{
..
- /* build data to be able to lookup the CommandIds of catalog tuples */
+ /*
+ * build data to be able to lookup the CommandIds of catalog tuples
+ */
ReorderBufferBuildTupleCidHash(rb, txn);
..
}
Is there a need to change the formatting of the comment?
7.
ReorderBufferProcessTXN()
{
..
if (using_subtxn)
- BeginInternalSubTransaction("replay");
+ BeginInternalSubTransaction("stream");
else
StartTransactionCommand();
..
}
I am not sure changing unconditionally "replay" to "stream" is a good
idea. How about something like BeginInternalSubTransaction(streaming
? "stream" : "replay");?
8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* use as a normal record. It'll be cleaned up at the end
* of INSERT processing.
*/
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");
You have removed this check but all other handling of specinsert is
same as far as this patch is concerned. Why so?
9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
* freed/reused while restoring spooled data from
* disk.
*/
- Assert(change->data.tp.newtuple != NULL);
-
dlist_delete(&change->node);
Why is this Assert removed?
10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
relations[nrelations++] = relation;
}
- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);
Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that? Basically, rather
than having streaming check in this function, lets do it in some other
internal function. And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible. That
will make this function look clean.
11.
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
..
I think the above comment needs to be updated after this patch. This
API can now be used during the decode of both a in-progress and a
committed transaction.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-13 15:46:26 |
Message-ID: | CAFiTN-vNqFv=PFReoQK0+dZ6irEmYK_hRVe3t7m=jS4C1NiH8g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > > ----------------------------------------------------------------------------------------
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > > +
> > > +static void
> > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > > + int nrelations, Relation relations[],
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > In the above and similar APIs, there are parameters like relation
> > > which are not used. I think you should add some comments atop these
> > > APIs to explain why it is so? I guess it is because we want to keep
> > > them similar to non-stream version of APIs and we can't display
> > > relation or other information as the transaction is still in-progress.
> >
> > I think because the interfaces are designed that way because other
> > decoding plugins might need it e.g. in pgoutput we need change and
> > relation but not here. We have other similar examples also e.g.
> > pg_decode_message has the parameter txn but not used. Do you think we
> > still need to add comments?
> >
>
> In that case, we can leave but lets ensure that we are not exposing
> any parameter which is not used and if there is any due to some
> reason, we should document it. I will also look into this.
>
> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here? I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > I am not sure about this, Because for the very first stream we will
> > report the location of the first lsn of the stream and for the
> > consecutive stream we will report the last lsn in the stream.
> >
>
> Yeah, that doesn't seem to be consistent. How about if get it as an
> additional parameter? The caller can pass the lsn of the very first
> change it is trying to decode in this stream.
Hmm, I think we need to call ReorderBufferIterTXNInit and
ReorderBufferIterTXNNext and get the first change of the stream after
that we shall call stream start then we can find out the first LSN of
the stream. I will see how to do so that it doesn't look awkward.
Basically, as of now, our code is of this layout.
1. stream_start;
2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
stream changes
}
3. stream stop
So if we want to know the first lsn of this stream then we shall do
something like this
1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
2. if first_change
stream_start;
stream changes
}
3. stream stop
> > >
> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > > */
> > > static void
> > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > > dlist_iter iter;
> > > HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change. Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID?
> >
> > Actually, even if we haven't decoded the DDL operation but in the
> > actual system table the tuple might have been deleted from the next
> > operation. e.g. while we are streaming the INSERT it is possible that
> > the truncate has already deleted that tuple and set the max for the
> > tuple. So before streaming patch, we were only streaming the INSERT
> > only on commit so by that time we had got all the operation which has
> > done DDL and we would have already prepared tuple CID hash.
> >
>
> Okay, but I think for that case how good is that we always allow CID
> hash table to be built even if there are no catalog changes in TXN
> (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
> while resolving the cmin/cmax?
Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
NULL then we can return as unresolved and then caller can take a call
based on that.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-14 05:41:46 |
Message-ID: | CAA4eK1J0eufoNPayZqkSfzCMozphWfKR-g=92dZh6B=K7qeAXw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 13, 2020 at 9:16 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > > > 4.
> > > > +static void
> > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > > +{
> > > > + LogicalDecodingContext *ctx = cache->private_data;
> > > > + LogicalErrorCallbackState state;
> > > > + ErrorContextCallback errcallback;
> > > > +
> > > > + Assert(!ctx->fast_forward);
> > > > +
> > > > + /* We're only supposed to call this when streaming is supported. */
> > > > + Assert(ctx->streaming);
> > > > +
> > > > + /* Push callback + info on the error context stack */
> > > > + state.ctx = ctx;
> > > > + state.callback_name = "stream_start";
> > > > + /* state.report_location = apply_lsn; */
> > > >
> > > > Why can't we supply the report_location here? I think here we need to
> > > > report txn->first_lsn if this is the very first stream and
> > > > txn->final_lsn if it is any consecutive one.
> > >
> > > I am not sure about this, Because for the very first stream we will
> > > report the location of the first lsn of the stream and for the
> > > consecutive stream we will report the last lsn in the stream.
> > >
> >
> > Yeah, that doesn't seem to be consistent. How about if get it as an
> > additional parameter? The caller can pass the lsn of the very first
> > change it is trying to decode in this stream.
>
> Hmm, I think we need to call ReorderBufferIterTXNInit and
> ReorderBufferIterTXNNext and get the first change of the stream after
> that we shall call stream start then we can find out the first LSN of
> the stream. I will see how to do so that it doesn't look awkward.
> Basically, as of now, our code is of this layout.
>
> 1. stream_start;
> 2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
> while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
> {
> stream changes
> }
> 3. stream stop
>
> So if we want to know the first lsn of this stream then we shall do
> something like this
>
> 1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
> while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
> {
> 2. if first_change
> stream_start;
>
> stream changes
> }
> 3. stream stop
>
Yeah, something like that would work. I think you need to see it is
first change for 'streaming' mode.
> > > >
> > > > 11.
> > > > - * HeapTupleSatisfiesHistoricMVCC.
> > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > > + *
> > > > + * We do build the hash table even if there are no CIDs. That's
> > > > + * because when streaming in-progress transactions we may run into
> > > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > > + * yet when applying the INSERT. So we build a hash table so that
> > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > > + *
> > > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > > + * out when decoding transaction at commit time (at which point it's
> > > > + * guaranteed to see all CIDs).
> > > > */
> > > > static void
> > > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > > *rb, ReorderBufferTXN *txn)
> > > > dlist_iter iter;
> > > > HASHCTL hash_ctl;
> > > >
> > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > > - return;
> > > > -
> > > >
> > > > I don't understand this change. Why would "INSERT followed by
> > > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > > CID?
> > >
> > > Actually, even if we haven't decoded the DDL operation but in the
> > > actual system table the tuple might have been deleted from the next
> > > operation. e.g. while we are streaming the INSERT it is possible that
> > > the truncate has already deleted that tuple and set the max for the
> > > tuple. So before streaming patch, we were only streaming the INSERT
> > > only on commit so by that time we had got all the operation which has
> > > done DDL and we would have already prepared tuple CID hash.
> > >
> >
> > Okay, but I think for that case how good is that we always allow CID
> > hash table to be built even if there are no catalog changes in TXN
> > (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
> > while resolving the cmin/cmax?
>
> Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
> NULL then we can return as unresolved and then caller can take a call
> based on that.
>
Yeah, and add appropriate comments about why we are doing so and in
what kind of scenario that can happen.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 09:17:27 |
Message-ID: | CAFiTN-vBmgbv0wRjupuYHZOh_4ubPi8FdTX=MDeMmn4U0ZZYGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > I have fixed one more issue in 0010 patch. The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.
>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> ----------------------------------------------------------------------------------------
> 2.
> + <para>
> + Similar to spill-to-disk behavior, streaming is triggered when the total
> + amount of changes decoded from the WAL (for all in-progress transactions)
> + exceeds limit defined by
> <varname>logical_decoding_work_mem</varname> setting.
> + At that point the largest toplevel transaction (measured by
> amount of memory
> + currently used for decoded changes) is selected and streamed.
> + </para>
>
> I think we need to explain here the cases/exception where we need to
> spill even when stream is enabled and check if this is per latest
> implementation, otherwise, update it.
Done
> 3.
> + * To support streaming, we require change/commit/abort callbacks. The
> + * message callback is optional, similarly to regular output plugins.
>
> /similarly/similar
Done
> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here? I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.
Done
> 5.
> +static void
> +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_stop";
> + /* state.report_location = apply_lsn; */
>
> Can't we report txn->final_lsn here
We are already setting this to the txn->final_ls in 0006 patch, but I
have moved it into this patch now.
> 6. I think it will be good if we can provide an example of streaming
> changes via test_decoding at
> https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> can also explain there why the user is not expected to see the actual
> data in the stream.
I have a few problems to solve here.
- With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);). I think we should show the actual values instead of what
we are doing now.
- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple. I think we can show the partial output?
> v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> ----------------------------------------------------------------------------------------
> 7.
> + /*
> + * We don't expect direct calls to table_tuple_get_latest_tid with valid
> + * CheckXidAlive for catalog or regular tables.
>
> There is an extra space between 'CheckXidAlive' and 'for'. I can see
> similar problems in other places as well where this comment is used,
> fix those as well.
Done
> 8.
> +/*
> + * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
> + * transaction. Currently, it is used in logical decoding. It's possible
> + * that such transactions can get aborted while the decoding is ongoing in
> + * which case we skip decoding that particular transaction. To ensure that we
> + * check whether the CheckXidAlive is aborted after fetching the tuple from
> + * system tables. We also ensure that during logical decoding we never
> + * directly access the tableam or heap APIs because we are checking for the
> + * concurrent aborts only in systable_* APIs.
> + */
>
> In this comment, there is an inconsistency in the space used after
> completing the sentence. In the part "transaction. To", single space
> is used whereas at other places two spaces are used after a full stop.
Done
> v20-0005-Implement-streaming-mode-in-ReorderBuffer
> -----------------------------------------------------------------------------
> 9.
> Implement streaming mode in ReorderBuffer
>
> Instead of serializing the transaction to disk after reaching the
> maximum number of changes in memory (4096 changes), we consume the
> changes we have in memory and invoke new stream API methods. This
> happens in ReorderBufferStreamTXN() using about the same logic as
> in ReorderBufferCommit() logic.
>
> I think the above part of the commit message needs to be updated.
Done
> 10.
> Theoretically, we could get rid of the k-way merge, and append the
> changes to the toplevel xact directly (and remember the position
> in the list in case the subxact gets aborted later).
>
> I don't think this part of the commit message is correct as we
> sometimes need to spill even during streaming. Please check the
> entire commit message and update according to the latest
> implementation.
Done
> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
> */
> static void
> ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
> dlist_iter iter;
> HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change. Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID? The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction. We might want
> to generate a test to once validate this behavior.
>
> Also, the comment refers to tqual.c which is wrong as this API is now
> in heapam_visibility.c.
Done.
> 12.
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access. Also reset the
> + * sysbegin_called flag.
> */
> - if (txn->base_snapshot == NULL)
> + if (!TransactionIdDidCommit(xid))
> {
> - Assert(txn->ninvalidations == 0);
> - ReorderBufferCleanupTXN(rb, txn);
> - return;
> + CheckXidAlive = xid;
> + bsysscan = false;
> }
>
> In the comment, the flag name 'sysbegin_called' should be bsysscan.
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 09:17:57 |
Message-ID: | CAFiTN-v_ydjaCksAA3obA67LaC5imN4mbH4J+vr+NBb6YPvmrA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > > ----------------------------------------------------------------------------------------
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > > +
> > > +static void
> > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > > + int nrelations, Relation relations[],
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > In the above and similar APIs, there are parameters like relation
> > > which are not used. I think you should add some comments atop these
> > > APIs to explain why it is so? I guess it is because we want to keep
> > > them similar to non-stream version of APIs and we can't display
> > > relation or other information as the transaction is still in-progress.
> >
> > I think because the interfaces are designed that way because other
> > decoding plugins might need it e.g. in pgoutput we need change and
> > relation but not here. We have other similar examples also e.g.
> > pg_decode_message has the parameter txn but not used. Do you think we
> > still need to add comments?
> >
>
> In that case, we can leave but lets ensure that we are not exposing
> any parameter which is not used and if there is any due to some
> reason, we should document it. I will also look into this.
Ok
> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here? I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > I am not sure about this, Because for the very first stream we will
> > report the location of the first lsn of the stream and for the
> > consecutive stream we will report the last lsn in the stream.
> >
>
> Yeah, that doesn't seem to be consistent. How about if get it as an
> additional parameter? The caller can pass the lsn of the very first
> change it is trying to decode in this stream.
Done
> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > > */
> > > static void
> > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > > dlist_iter iter;
> > > HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change. Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID?
> >
> > Actually, even if we haven't decoded the DDL operation but in the
> > actual system table the tuple might have been deleted from the next
> > operation. e.g. while we are streaming the INSERT it is possible that
> > the truncate has already deleted that tuple and set the max for the
> > tuple. So before streaming patch, we were only streaming the INSERT
> > only on commit so by that time we had got all the operation which has
> > done DDL and we would have already prepared tuple CID hash.
> >
>
> Okay, but I think for that case how good is that we always allow CID
> hash table to be built even if there are no catalog changes in TXN
> (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that
> while resolving the cmin/cmax?
Done
>
> Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
> ----------------------------------------------------------------------------------------------------------------
> 1.
> /*
> - * Binary heap comparison function.
> + * Binary heap comparison function (regular non-streaming iterator).
> */
> static int
> ReorderBufferIterCompare(Datum a, Datum b, void *arg)
>
> It seems to me the above comment change is not required as per the latest patch.
Done
> 2.
> * For subtransactions, we only mark them as streamed when there are
> + * any changes in them.
> + *
> + * We do it this way because of aborts - we don't want to send aborts
> + * for XIDs the downstream is not aware of. And of course, it always
> + * knows about the toplevel xact (we send the XID in all messages),
> + * but we never stream XIDs of empty subxacts.
> + */
> + if ((!txn->toptxn) || (txn->nentries_mem != 0))
> + txn->txn_flags |= RBTXN_IS_STREAMED;
>
> /when there are any changes in them/when there are changes in them. I
> think we don't need 'any' in the above sentence.
Done
> 3.
> And, during catalog scan we can check the status of the xid and
> + * if it is aborted we will report a specific error that we can ignore. We
> + * might have already streamed some of the changes for the aborted
> + * (sub)transaction, but that is fine because when we decode the abort we will
> + * stream abort message to truncate the changes in the subscriber.
> + */
> +static inline void
> +SetupCheckXidLive(TransactionId xid)
>
> In the above comment, I don't think it is right to say that we ignore
> the error raised due to the aborted transaction. We need to say that
> we discard the already streamed changes on such an error.
Done.
> 4.
> +static inline void
> +SetupCheckXidLive(TransactionId xid)
> +{
> /*
> - * If this transaction has no snapshot, it didn't make any changes to the
> - * database, so there's nothing to decode. Note that
> - * ReorderBufferCommitChild will have transferred any snapshots from
> - * subtransactions if there were any.
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access. Also reset the
> + * sysbegin_called flag.
> */
> - if (txn->base_snapshot == NULL)
> + if (!TransactionIdDidCommit(xid))
> {
> - Assert(txn->ninvalidations == 0);
> - ReorderBufferCleanupTXN(rb, txn);
> - return;
> + CheckXidAlive = xid;
> + bsysscan = false;
> }
>
> I think this function is inline as it needs to be called for each
> change. If that is the case and otherwise also, isn't it better that
> we check if passed xid is the same as CheckXidAlive before checking
> TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> calling it for each change might not be a good idea?
Done, Also I think it is good the check the TransactionIdIsInProgress
instead of !TransactionIdDidCommit. I have changed that as well.
> 5.
> setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access. Also reset the
> + * sysbegin_called flag.
>
> /if the xid aborted/if the xid is aborted. missing comma after Also.
Done
> 6.
> ReorderBufferProcessTXN()
> {
> ..
> - /* build data to be able to lookup the CommandIds of catalog tuples */
> + /*
> + * build data to be able to lookup the CommandIds of catalog tuples
> + */
> ReorderBufferBuildTupleCidHash(rb, txn);
> ..
> }
>
> Is there a need to change the formatting of the comment?
No need changed back.
>
> 7.
> ReorderBufferProcessTXN()
> {
> ..
> if (using_subtxn)
> - BeginInternalSubTransaction("replay");
> + BeginInternalSubTransaction("stream");
> else
> StartTransactionCommand();
> ..
> }
>
> I am not sure changing unconditionally "replay" to "stream" is a good
> idea. How about something like BeginInternalSubTransaction(streaming
> ? "stream" : "replay");?
Done
> 8.
> @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> * use as a normal record. It'll be cleaned up at the end
> * of INSERT processing.
> */
> - if (specinsert == NULL)
> - elog(ERROR, "invalid ordering of speculative insertion changes");
>
> You have removed this check but all other handling of specinsert is
> same as far as this patch is concerned. Why so?
Seems like a merge issue, or the leftover from the old design of the
toast handling where we were streaming with the partial tuple.
fixed now.
> 9.
> @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> * freed/reused while restoring spooled data from
> * disk.
> */
> - Assert(change->data.tp.newtuple != NULL);
> -
> dlist_delete(&change->node);
>
> Why is this Assert removed?
Same cause as above so fixed.
> 10.
> @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> relations[nrelations++] = relation;
> }
>
> - rb->apply_truncate(rb, txn, nrelations, relations, change);
> + if (streaming)
> + {
> + rb->stream_truncate(rb, txn, nrelations, relations, change);
> +
> + /* Remember that we have sent some data. */
> + change->txn->any_data_sent = true;
> + }
> + else
> + rb->apply_truncate(rb, txn, nrelations, relations, change);
>
> Can we encapsulate this in a separate function like
> ReorderBufferApplyTruncate or something like that? Basically, rather
> than having streaming check in this function, lets do it in some other
> internal function. And we can likewise do it for all the streaming
> checks in this function or at least whereever it is feasible. That
> will make this function look clean.
Done for truncate and change. I think we can create a few more such
functions for
start/stop and cleanup handling on error. I will work on that.
> 11.
> + * We currently can only decode a transaction's contents when its commit
> + * record is read because that's the only place where we know about cache
> + * invalidations. Thus, once a toplevel commit is read, we iterate over the top
> + * and subtransactions (using a k-way merge) and replay the changes in lsn
> + * order.
> + */
> +void
> +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> {
> ..
>
> I think the above comment needs to be updated after this patch. This
> API can now be used during the decode of both a in-progress and a
> committed transaction.
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 10:34:18 |
Message-ID: | CAA4eK1JUyU32AMYNYT=xD5y4P1zq8uyWNbHz=aQ8Be5sVp0UBw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> > 6. I think it will be good if we can provide an example of streaming
> > changes via test_decoding at
> > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > can also explain there why the user is not expected to see the actual
> > data in the stream.
>
> I have a few problems to solve here.
> - With streaming transaction also shall we show the actual values or
> we shall do like it is currently in the patch
> (appendStringInfo(ctx->out, "streaming change for TXN %u",
> txn->xid);). I think we should show the actual values instead of what
> we are doing now.
>
I think why we don't want to display the tuple at this stage is
because it is not clear by this time if the transaction will commit or
abort. I am not sure if displaying the contents of aborted
transactions is a good idea but if there is a reason for doing so, we
can do it later as well.
> - In the example we can not show a real example, because of the
> in-progress transaction to show the changes, we might have to
> implement a lot of tuple. I think we can show the partial output?
>
I think we can display what API will actually display, what is the
confusion here.
I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
any, then leave those and fix others.
Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
}
case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+ change->data.msg.prefix,
+ change->data.msg.message_size,
+ change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+ change->data.msg.prefix,
+ change->data.msg.message_size,
+ change->data.msg.message);
Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?
2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }
I am not sure if it is good to use final_lsn for this purpose. See
comments for this variable in reorderbuffer.h. Basically, it is used
for a specific purpose on different occasions. Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well. Can we pass an
additional parameter to stream_stop() instead?
3.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+ txn, command_id);
This code is used at two different places, can we try to keep this in
a single function.
4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block. If there is an error after calling it in a
try block, we might call it again via catch. I think that will lead
to sending a stop message twice. Won't that be a problem? See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.
5.
+ if (streaming)
+ {
+ /* Discard the changes that we just streamed. */
+ ReorderBufferTruncateTXN(rb, txn);
- PG_RE_THROW();
+ /* Re-throw only if it's not an abort. */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ {
+ FlushErrorState();
+ FreeErrorData(errdata);
+ errdata = NULL;
+
I think here we can write few comments on why we are doing error-code
specific handling, basically, explain a bit about concurrent abort
handling and or refer to the part of comments where it is explained.
6.
PG_CATCH();
{
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData *errdata = CopyErrorData();
I don't understand the usage of memory context in this part of the
code. Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error. If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.
7.
+ReorderBufferCommit()
{
..
+ /*
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
+ *
+ * XXX Called after everything (origin ID and LSN, ...) is stored in the
+ * transaction, so we don't pass that directly.
+ *
+ * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+ */
+ if (rbtxn_is_streamed(txn))
+ {
+ ReorderBufferStreamCommit(rb, txn);
+ return;
+ }
+
..
}
"XXX Somewhat hackish redirection, perhaps needs to be refactored?"
What kind of refactoring we can do here? To me, it looks okay.
8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
}
Why are we marking top transaction here?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 10:50:20 |
Message-ID: | CAFiTN-vDb_wB9GdaK2BKHyhoZ9SwOwmTEg_UCy2mq=pMgb4J4Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > > 6. I think it will be good if we can provide an example of streaming
> > > changes via test_decoding at
> > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > > can also explain there why the user is not expected to see the actual
> > > data in the stream.
> >
> > I have a few problems to solve here.
> > - With streaming transaction also shall we show the actual values or
> > we shall do like it is currently in the patch
> > (appendStringInfo(ctx->out, "streaming change for TXN %u",
> > txn->xid);). I think we should show the actual values instead of what
> > we are doing now.
> >
>
> I think why we don't want to display the tuple at this stage is
> because it is not clear by this time if the transaction will commit or
> abort. I am not sure if displaying the contents of aborted
> transactions is a good idea but if there is a reason for doing so, we
> can do it later as well.
Ok.
>
> > - In the example we can not show a real example, because of the
> > in-progress transaction to show the changes, we might have to
> > implement a lot of tuple. I think we can show the partial output?
> >
>
> I think we can display what API will actually display, what is the
> confusion here.
What, I meant is that even with the logical_decoding_work_mem=64kb, we
need to have quite a few changes in a transaction to stream it so the
example output will be quite big in size. So I told we might not show
the real example instead we will just show a few lines and cut the
remaining. But, I got your point we can just show how it will look
like.
>
> I have a few more comments on the previous version of patch
> v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
> any, then leave those and fix others.
>
> Review comments:
> ------------------------------
> 1.
> @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> TransactionId xid,
> }
>
> case REORDER_BUFFER_CHANGE_MESSAGE:
> - rb->message(rb, txn, change->lsn, true,
> - change->data.msg.prefix,
> - change->data.msg.message_size,
> - change->data.msg.message);
> + if (streaming)
> + rb->stream_message(rb, txn, change->lsn, true,
> + change->data.msg.prefix,
> + change->data.msg.message_size,
> + change->data.msg.message);
> + else
> + rb->message(rb, txn, change->lsn, true,
> + change->data.msg.prefix,
> + change->data.msg.message_size,
> + change->data.msg.message);
>
> Don't we need to set any_data_sent flag while streaming messages as we
> do for other types of changes?
Actually, pgoutput plugin don't send any data on stream_message. But,
I agree that how other plugin will handle. I will analyze this part
again, maybe we have to such flag at the plugin level and whether stop
is sent to not can also be handled at the plugin level.
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + if (!XLogRecPtrIsInvalid(prev_lsn))
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> I am not sure if it is good to use final_lsn for this purpose. See
> comments for this variable in reorderbuffer.h. Basically, it is used
> for a specific purpose on different occasions. Now, if we want to
> start using it for a new purpose, we need to study its interaction
> with all other places and update the comments as well. Can we pass an
> additional parameter to stream_stop() instead?
I think it was in sycn with the spill code right? I mean the last
change we spill is set as the final_lsn and same is done here.
Other comments looks fine so I will work on them and reply separatly.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 11:05:32 |
Message-ID: | CAA4eK1KLFynUU6jqB0nYJRDZ6GGdkUv=ydJ1pycKbv=92zsJpA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
>
> >
> > > - In the example we can not show a real example, because of the
> > > in-progress transaction to show the changes, we might have to
> > > implement a lot of tuple. I think we can show the partial output?
> > >
> >
> > I think we can display what API will actually display, what is the
> > confusion here.
>
> What, I meant is that even with the logical_decoding_work_mem=64kb, we
> need to have quite a few changes in a transaction to stream it so the
> example output will be quite big in size. So I told we might not show
> the real example instead we will just show a few lines and cut the
> remaining. But, I got your point we can just show how it will look
> like.
>
Right.
> >
> > I have a few more comments on the previous version of patch
> > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
> > any, then leave those and fix others.
> >
> > Review comments:
> > ------------------------------
> > 1.
> > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > TransactionId xid,
> > }
> >
> > case REORDER_BUFFER_CHANGE_MESSAGE:
> > - rb->message(rb, txn, change->lsn, true,
> > - change->data.msg.prefix,
> > - change->data.msg.message_size,
> > - change->data.msg.message);
> > + if (streaming)
> > + rb->stream_message(rb, txn, change->lsn, true,
> > + change->data.msg.prefix,
> > + change->data.msg.message_size,
> > + change->data.msg.message);
> > + else
> > + rb->message(rb, txn, change->lsn, true,
> > + change->data.msg.prefix,
> > + change->data.msg.message_size,
> > + change->data.msg.message);
> >
> > Don't we need to set any_data_sent flag while streaming messages as we
> > do for other types of changes?
>
> Actually, pgoutput plugin don't send any data on stream_message. But,
> I agree that how other plugin will handle. I will analyze this part
> again, maybe we have to such flag at the plugin level and whether stop
> is sent to not can also be handled at the plugin level.
>
Okay, lets discuss this after your analysis.
> > 2.
> > + if (streaming)
> > + {
> > + /*
> > + * Set the last of the stream as the final lsn before calling
> > + * stream stop.
> > + */
> > + if (!XLogRecPtrIsInvalid(prev_lsn))
> > + txn->final_lsn = prev_lsn;
> > + rb->stream_stop(rb, txn);
> > + }
> >
> > I am not sure if it is good to use final_lsn for this purpose. See
> > comments for this variable in reorderbuffer.h. Basically, it is used
> > for a specific purpose on different occasions. Now, if we want to
> > start using it for a new purpose, we need to study its interaction
> > with all other places and update the comments as well. Can we pass an
> > additional parameter to stream_stop() instead?
>
> I think it was in sycn with the spill code right? I mean the last
> change we spill is set as the final_lsn and same is done here.
>
But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
changes. Now, in some case if we first do serialization, then perform
streaming and then tried to call ReorderBufferRestoreCleanup(), it
might not work as intended. Now, this might not happen today but I
don't think we have any protection to avoid that.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-15 11:50:55 |
Message-ID: | CAFiTN-teGOW6wwXxfR6A2v6Rg6cWnLvCk-6thwghcycpwrbWUQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 4:35 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> >
> > >
> > > > - In the example we can not show a real example, because of the
> > > > in-progress transaction to show the changes, we might have to
> > > > implement a lot of tuple. I think we can show the partial output?
> > > >
> > >
> > > I think we can display what API will actually display, what is the
> > > confusion here.
> >
> > What, I meant is that even with the logical_decoding_work_mem=64kb, we
> > need to have quite a few changes in a transaction to stream it so the
> > example output will be quite big in size. So I told we might not show
> > the real example instead we will just show a few lines and cut the
> > remaining. But, I got your point we can just show how it will look
> > like.
> >
>
> Right.
>
> > >
> > > I have a few more comments on the previous version of patch
> > > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
> > > any, then leave those and fix others.
> > >
> > > Review comments:
> > > ------------------------------
> > > 1.
> > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > TransactionId xid,
> > > }
> > >
> > > case REORDER_BUFFER_CHANGE_MESSAGE:
> > > - rb->message(rb, txn, change->lsn, true,
> > > - change->data.msg.prefix,
> > > - change->data.msg.message_size,
> > > - change->data.msg.message);
> > > + if (streaming)
> > > + rb->stream_message(rb, txn, change->lsn, true,
> > > + change->data.msg.prefix,
> > > + change->data.msg.message_size,
> > > + change->data.msg.message);
> > > + else
> > > + rb->message(rb, txn, change->lsn, true,
> > > + change->data.msg.prefix,
> > > + change->data.msg.message_size,
> > > + change->data.msg.message);
> > >
> > > Don't we need to set any_data_sent flag while streaming messages as we
> > > do for other types of changes?
> >
> > Actually, pgoutput plugin don't send any data on stream_message. But,
> > I agree that how other plugin will handle. I will analyze this part
> > again, maybe we have to such flag at the plugin level and whether stop
> > is sent to not can also be handled at the plugin level.
> >
>
> Okay, lets discuss this after your analysis.
>
> > > 2.
> > > + if (streaming)
> > > + {
> > > + /*
> > > + * Set the last of the stream as the final lsn before calling
> > > + * stream stop.
> > > + */
> > > + if (!XLogRecPtrIsInvalid(prev_lsn))
> > > + txn->final_lsn = prev_lsn;
> > > + rb->stream_stop(rb, txn);
> > > + }
> > >
> > > I am not sure if it is good to use final_lsn for this purpose. See
> > > comments for this variable in reorderbuffer.h. Basically, it is used
> > > for a specific purpose on different occasions. Now, if we want to
> > > start using it for a new purpose, we need to study its interaction
> > > with all other places and update the comments as well. Can we pass an
> > > additional parameter to stream_stop() instead?
> >
> > I think it was in sycn with the spill code right? I mean the last
> > change we spill is set as the final_lsn and same is done here.
> >
>
> But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
> changes. Now, in some case if we first do serialization, then perform
> streaming and then tried to call ReorderBufferRestoreCleanup(),it
> might not work as intended. Now, this might not happen today but I
> don't think we have any protection to avoid that.
If streaming is complete then we will remove the serialize flag so it
will not cause any issue. However, we can avoid setting final_lsn
here and pass some parameters to the stream_stop about the last lsn of
the stream.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-17 07:10:46 |
Message-ID: | CAFiTN-sqo+UaEzP0y5sAvaLd60awR55fgt6HFN+RR1MojRLV1w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > > 6. I think it will be good if we can provide an example of streaming
> > > changes via test_decoding at
> > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > > can also explain there why the user is not expected to see the actual
> > > data in the stream.
> >
> > I have a few problems to solve here.
> > - With streaming transaction also shall we show the actual values or
> > we shall do like it is currently in the patch
> > (appendStringInfo(ctx->out, "streaming change for TXN %u",
> > txn->xid);). I think we should show the actual values instead of what
> > we are doing now.
> >
>
> I think why we don't want to display the tuple at this stage is
> because it is not clear by this time if the transaction will commit or
> abort. I am not sure if displaying the contents of aborted
> transactions is a good idea but if there is a reason for doing so, we
> can do it later as well.
>
> > - In the example we can not show a real example, because of the
> > in-progress transaction to show the changes, we might have to
> > implement a lot of tuple. I think we can show the partial output?
> >
>
> I think we can display what API will actually display, what is the
> confusion here.
Added example in the v22-0011 patch where I have added the API to get
streaming changes.
> I have a few more comments on the previous version of patch
> v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed
> any, then leave those and fix others.
>
> Review comments:
> ------------------------------
> 1.
> @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> TransactionId xid,
> }
>
> case REORDER_BUFFER_CHANGE_MESSAGE:
> - rb->message(rb, txn, change->lsn, true,
> - change->data.msg.prefix,
> - change->data.msg.message_size,
> - change->data.msg.message);
> + if (streaming)
> + rb->stream_message(rb, txn, change->lsn, true,
> + change->data.msg.prefix,
> + change->data.msg.message_size,
> + change->data.msg.message);
> + else
> + rb->message(rb, txn, change->lsn, true,
> + change->data.msg.prefix,
> + change->data.msg.message_size,
> + change->data.msg.message);
>
> Don't we need to set any_data_sent flag while streaming messages as we
> do for other types of changes?
I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data, but this is not complete as
the output plugin can also take the decision not to send. So I think
this should not be done as part of this patch and can be done
separately. I think there is already a thread for handling the
same[1]
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + if (!XLogRecPtrIsInvalid(prev_lsn))
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> I am not sure if it is good to use final_lsn for this purpose. See
> comments for this variable in reorderbuffer.h. Basically, it is used
> for a specific purpose on different occasions. Now, if we want to
> start using it for a new purpose, we need to study its interaction
> with all other places and update the comments as well. Can we pass an
> additional parameter to stream_stop() instead?
Done
> 3.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> + txn, command_id);
>
> This code is used at two different places, can we try to keep this in
> a single function.
Done
> 4.
> In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> the try and catch block. If there is an error after calling it in a
> try block, we might call it again via catch. I think that will lead
> to sending a stop message twice. Won't that be a problem? See the
> usage of iterstate in the catch block, we have made it safe from a
> similar problem.
IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
in TRY block we have already stopped the stream then we should not get
that error. I have added the comments for the same.
> 5.
> + if (streaming)
> + {
> + /* Discard the changes that we just streamed. */
> + ReorderBufferTruncateTXN(rb, txn);
>
> - PG_RE_THROW();
> + /* Re-throw only if it's not an abort. */
> + if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> + }
> + else
> + {
> + FlushErrorState();
> + FreeErrorData(errdata);
> + errdata = NULL;
> +
>
> I think here we can write few comments on why we are doing error-code
> specific handling, basically, explain a bit about concurrent abort
> handling and or refer to the part of comments where it is explained.
Done
> 6.
> PG_CATCH();
> {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData *errdata = CopyErrorData();
>
> I don't understand the usage of memory context in this part of the
> code. Basically, you are switching to CurrentMemoryContext here, do
> some error handling and then again reset back to some random context
> before rethrowing the error. If there is some purpose for it, then it
> might be better if you can write a few comments to explain the same.
Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error. So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt. So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context. And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.
>
> 7.
> +ReorderBufferCommit()
> {
> ..
> + /*
> + * If the transaction was (partially) streamed, we need to commit it in a
> + * 'streamed' way. That is, we first stream the remaining part of the
> + * transaction, and then invoke stream_commit message.
> + *
> + * XXX Called after everything (origin ID and LSN, ...) is stored in the
> + * transaction, so we don't pass that directly.
> + *
> + * XXX Somewhat hackish redirection, perhaps needs to be refactored?
> + */
> + if (rbtxn_is_streamed(txn))
> + {
> + ReorderBufferStreamCommit(rb, txn);
> + return;
> + }
> +
> ..
> }
>
> "XXX Somewhat hackish redirection, perhaps needs to be refactored?"
> What kind of refactoring we can do here? To me, it looks okay.
I think it looks fine to me also. So I have removed this comment.
> 8.
> @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> *rb, TransactionId xid,
> txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
> txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> +
> + /*
> + * TOCHECK: Mark toplevel transaction as having catalog changes too
> + * if one of its children has.
> + */
> + if (txn->toptxn != NULL)
> + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> }
>
> Why are we marking top transaction here?
We need to mark top transaction to decide whether to build tuplecid
hash or not. In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-18 10:40:40 |
Message-ID: | CAA4eK1JpRrES8EAqJtBQfvo-RapA2pSnLJ-SeS+3GS39zNG0xA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Review comments:
> > ------------------------------
> > 1.
> > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > TransactionId xid,
> > }
> >
> > case REORDER_BUFFER_CHANGE_MESSAGE:
> > - rb->message(rb, txn, change->lsn, true,
> > - change->data.msg.prefix,
> > - change->data.msg.message_size,
> > - change->data.msg.message);
> > + if (streaming)
> > + rb->stream_message(rb, txn, change->lsn, true,
> > + change->data.msg.prefix,
> > + change->data.msg.message_size,
> > + change->data.msg.message);
> > + else
> > + rb->message(rb, txn, change->lsn, true,
> > + change->data.msg.prefix,
> > + change->data.msg.message_size,
> > + change->data.msg.message);
> >
> > Don't we need to set any_data_sent flag while streaming messages as we
> > do for other types of changes?
>
> I think any_data_sent, was added to avoid sending abort to the
> subscriber if we haven't sent any data, but this is not complete as
> the output plugin can also take the decision not to send. So I think
> this should not be done as part of this patch and can be done
> separately. I think there is already a thread for handling the
> same[1]
>
Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases. OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it. BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?
>
> > 4.
> > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> > the try and catch block. If there is an error after calling it in a
> > try block, we might call it again via catch. I think that will lead
> > to sending a stop message twice. Won't that be a problem? See the
> > usage of iterstate in the catch block, we have made it safe from a
> > similar problem.
>
> IMHO, we don't need that, because we only call stream_stop in the
> catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
> in TRY block we have already stopped the stream then we should not get
> that error. I have added the comments for the same.
>
I am still slightly nervous about it as I don't see any solid
guarantee for the same. You are right as the code stands today but
due to any code that gets added in the future, it might not remain
true. I feel it is better to have an Assert here to ensure that
stream_stop won't be called the second time. I don't see any good way
of doing it other than by maintaining flag or some state but I think
it will be good to ensure this.
>
> > 6.
> > PG_CATCH();
> > {
> > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> > + ErrorData *errdata = CopyErrorData();
> >
> > I don't understand the usage of memory context in this part of the
> > code. Basically, you are switching to CurrentMemoryContext here, do
> > some error handling and then again reset back to some random context
> > before rethrowing the error. If there is some purpose for it, then it
> > might be better if you can write a few comments to explain the same.
>
> Basically, the ccxt is the CurrentMemoryContext when we started the
> streaming and ecxt it the context when we catch the error. So
> ideally, before this change, it will rethrow in the context when we
> catch the error i.e. ecxt. So what we are trying to do is put it back
> to normal context (ccxt) and copy the error data in the normal
> context. And, if we are not handling it gracefully then put it back
> to the context it was in, and rethrow.
>
Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
we need to clean up the reorderbuffer by calling
ReorderBufferCleanupTXN? If so, then you can try to combine it with
the not-streaming else loop.
>
> > 8.
> > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > *rb, TransactionId xid,
> > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> >
> > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > +
> > + /*
> > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > + * if one of its children has.
> > + */
> > + if (txn->toptxn != NULL)
> > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > }
> >
> > Why are we marking top transaction here?
>
> We need to mark top transaction to decide whether to build tuplecid
> hash or not. In non-streaming mode, we are only sending during the
> commit time, and during commit time we know whether the top
> transaction has any catalog changes or not based on the invalidation
> message so we are marking the top transaction there in DecodeCommit.
> Since here we are not waiting till commit so we need to mark the top
> transaction as soon as we mark any of its child transactions.
>
But how does it help? We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required. So, how will it help in setting it
while processing for subxid. Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-18 12:27:28 |
Message-ID: | CAA4eK1J0GLCqX6ab3u3Z_rxak1k4LchPptZmn23H9CGDJJ0hYA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit. Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
Here, it might better to add a comment on why we expect only
Insert/Update? Also, it might be better that we add an assert for
other operations.
2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
* disk.
*/
dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
- change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+ change);
}
This seems to be a spurious change.
3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple. So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {
This comment is just saying what you are doing in the if-check. I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that. The check has many
conditions, can we move it to a separate function to make the code
here look clean?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-19 09:04:33 |
Message-ID: | CAA4eK1JN23VBcfDXRMawwUaqU=9zDrZ6PuaR01kHLskJz+jJbw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>
> 3.
> + /*
> + * If streaming is enable and we have serialized this transaction because
> + * it had incomplete tuple. So if now we have got the complete tuple we
> + * can stream it.
> + */
> + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> + {
>
> This comment is just saying what you are doing in the if-check. I
> think you need to explain the rationale behind it. I don't like the
> variable name 'can_stream' because it matches ReorderBufferCanStream
> whereas it is for a different purpose, how about naming it as
> 'change_complete' or something like that. The check has many
> conditions, can we move it to a separate function to make the code
> here look clean?
>
Do we really need this? Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required. Can we move the changes related to the detection of
incomplete data to a separate function?
Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
IIUC, the basic idea used to handle incomplete changes (which is
possible in case of toast tuples and speculative inserts) is to mark
such TXNs as containing incomplete changes and then while finding the
largest top-level TXN for streaming, we ignore such TXN's and move to
next largest TXN. If none of the TXNs have complete changes then we
choose the largest (sub)transaction and spill the same to make the
in-memory changes below logical_decoding_work_mem threshold. This
idea can work but the strategy to choose the transaction is suboptimal
for cases where TXNs have some changes which are complete followed by
an incomplete toast or speculative tuple. I was having an offlist
discussion with Robert on this problem and he suggested that it would
be better if we track the complete part of changes separately and then
we can avoid the drawback mentioned above. I have thought about this
and I think it can work if we track the size and LSN of completed
changes. I think we need to ensure that if there is concurrent abort
then we discard all changes for current (sub)transaction not only up
to completed changes LSN whereas if the streaming is successful then
we can truncate the changes only up to completed changes LSN. What do
you think?
I wonder why you have done this as 0010 in the patch series, it should
be as 0006 after the
0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do
that way then it would be easier for me to review. Is there a reason
for not doing so?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-19 10:00:55 |
Message-ID: | CAFiTN-urcPsyoOJAuhhsH=s02jgxWg8Cyr3b+s_kuXChoMKJUg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > 3.
> > + /*
> > + * If streaming is enable and we have serialized this transaction because
> > + * it had incomplete tuple. So if now we have got the complete tuple we
> > + * can stream it.
> > + */
> > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > + {
> >
> > This comment is just saying what you are doing in the if-check. I
> > think you need to explain the rationale behind it. I don't like the
> > variable name 'can_stream' because it matches ReorderBufferCanStream
> > whereas it is for a different purpose, how about naming it as
> > 'change_complete' or something like that. The check has many
> > conditions, can we move it to a separate function to make the code
> > here look clean?
> >
>
> Do we really need this? Immediately after this check, we are calling
> ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> required.
Actually, ReorderBufferCheckMemoryLimit is only meant for checking
whether we need to stream the changes due to the memory limit. But
suppose when memory limit exceeds that time we could not stream the
transaction because there was only incomplete toast insert so we
serialized. Now, when we get the tuple which makes the changes
complete but now it is not crossing the memory limit as changes were
already serialized. So I am not sure whether it is a good idea to
stream the transaction as soon as we get the complete changes or we
shall wait till next time memory limit exceed and that time we select
the suitable candidate. Ideally, we were are in streaming more and
the transaction is serialized means it was already a candidate for
streaming but could not stream due to the incomplete changes so
shouldn't we stream it immediately as soon as its changes are complete
even though now we are in memory limit. Because our target is to
stream not spill so we should try to stream the spilled changes on the
first opportunity.
Can we move the changes related to the detection of
> incomplete data to a separate function?
Ok.
>
> Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
>
> + else if (rbtxn_has_toast_insert(txn) &&
> + ChangeIsInsertOrUpdate(change->action))
> + {
> + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> + can_stream = true;
> + }
> ..
> +#define ChangeIsInsertOrUpdate(action) \
> + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
>
> How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
Partial toast insert means we have inserted in the toast but not in
the main table. So even if it is spec insert we can form the complete
tuple, however, we can still not stream it because we haven't got
spec_confirm but for that, we are marking another flag. So if the
insert is aspect insert the toast insert will also be spec insert and
as part of that toast, spec inserts we are marking partial tuple so
cleaning that flag should happen when the spec insert is done for the
main table right?
> IIUC, the basic idea used to handle incomplete changes (which is
> possible in case of toast tuples and speculative inserts) is to mark
> such TXNs as containing incomplete changes and then while finding the
> largest top-level TXN for streaming, we ignore such TXN's and move to
> next largest TXN. If none of the TXNs have complete changes then we
> choose the largest (sub)transaction and spill the same to make the
> in-memory changes below logical_decoding_work_mem threshold. This
> idea can work but the strategy to choose the transaction is suboptimal
> for cases where TXNs have some changes which are complete followed by
> an incomplete toast or speculative tuple. I was having an offlist
> discussion with Robert on this problem and he suggested that it would
> be better if we track the complete part of changes separately and then
> we can avoid the drawback mentioned above. I have thought about this
> and I think it can work if we track the size and LSN of completed
> changes. I think we need to ensure that if there is concurrent abort
> then we discard all changes for current (sub)transaction not only up
> to completed changes LSN whereas if the streaming is successful then
> we can truncate the changes only up to completed changes LSN. What do
> you think?
>
> I wonder why you have done this as 0010 in the patch series, it should
> be as 0006 after the
> 0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do
> that way then it would be easier for me to review. Is there a reason
> for not doing so?
No reason, I can do that. Actually, later we can merge the changes to
0005 only, I kept separate for review. Anyway, in the next version, I
will make it as 0006.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-19 11:02:57 |
Message-ID: | CAA4eK1K=WoWrUh2fZVWzWB+Hg7Qs28o_qYQtfkGg0nNznZBaxg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > 3.
> > > + /*
> > > + * If streaming is enable and we have serialized this transaction because
> > > + * it had incomplete tuple. So if now we have got the complete tuple we
> > > + * can stream it.
> > > + */
> > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > > + {
> > >
> > > This comment is just saying what you are doing in the if-check. I
> > > think you need to explain the rationale behind it. I don't like the
> > > variable name 'can_stream' because it matches ReorderBufferCanStream
> > > whereas it is for a different purpose, how about naming it as
> > > 'change_complete' or something like that. The check has many
> > > conditions, can we move it to a separate function to make the code
> > > here look clean?
> > >
> >
> > Do we really need this? Immediately after this check, we are calling
> > ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> > required.
>
> Actually, ReorderBufferCheckMemoryLimit is only meant for checking
> whether we need to stream the changes due to the memory limit. But
> suppose when memory limit exceeds that time we could not stream the
> transaction because there was only incomplete toast insert so we
> serialized. Now, when we get the tuple which makes the changes
> complete but now it is not crossing the memory limit as changes were
> already serialized. So I am not sure whether it is a good idea to
> stream the transaction as soon as we get the complete changes or we
> shall wait till next time memory limit exceed and that time we select
> the suitable candidate.
>
I think it is better to wait till next time we exceed the memory threshold.
> Ideally, we were are in streaming more and
> the transaction is serialized means it was already a candidate for
> streaming but could not stream due to the incomplete changes so
> shouldn't we stream it immediately as soon as its changes are complete
> even though now we are in memory limit.
>
The only time we need to stream or spill is when we exceed memory
threshold. In the above case, it is possible that next time there is
some other candidate transaction that we can stream.
> >
> > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
> >
> > + else if (rbtxn_has_toast_insert(txn) &&
> > + ChangeIsInsertOrUpdate(change->action))
> > + {
> > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> > + can_stream = true;
> > + }
> > ..
> > +#define ChangeIsInsertOrUpdate(action) \
> > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
> >
> > How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
>
> Partial toast insert means we have inserted in the toast but not in
> the main table. So even if it is spec insert we can form the complete
> tuple, however, we can still not stream it because we haven't got
> spec_confirm but for that, we are marking another flag. So if the
> insert is aspect insert the toast insert will also be spec insert and
> as part of that toast, spec inserts we are marking partial tuple so
> cleaning that flag should happen when the spec insert is done for the
> main table right?
>
Sounds reasonable.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-19 12:04:04 |
Message-ID: | CAA4eK1+DjQA1c69w-RAypVz9VtYbMP0kYBtWu50wjhFWYPh8Dg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
>
> > 4.
> > +static void
> > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_start";
> > + /* state.report_location = apply_lsn; */
> >
> > Why can't we supply the report_location here? I think here we need to
> > report txn->first_lsn if this is the very first stream and
> > txn->final_lsn if it is any consecutive one.
>
> Done
>
Now after your change in stream_start_cb_wrapper, we assign
report_location as first_lsn passed as input to function but
write_location is still txn->first_lsn. Shouldn't we assing passed in
first_lsn to write_location? It seems assigning txn->first_lsn won't
be correct for streams other than first-one.
> > 5.
> > +static void
> > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_stop";
> > + /* state.report_location = apply_lsn; */
> >
> > Can't we report txn->final_lsn here
>
> We are already setting this to the txn->final_ls in 0006 patch, but I
> have moved it into this patch now.
>
Similar to previous point, here also, I think we need to assign report
and write location as last_lsn passed to this API.
>
>
> > v20-0005-Implement-streaming-mode-in-ReorderBuffer
> > -----------------------------------------------------------------------------
> > 10.
> > Theoretically, we could get rid of the k-way merge, and append the
> > changes to the toplevel xact directly (and remember the position
> > in the list in case the subxact gets aborted later).
> >
> > I don't think this part of the commit message is correct as we
> > sometimes need to spill even during streaming. Please check the
> > entire commit message and update according to the latest
> > implementation.
>
> Done
>
You seem to forgot about removing the other part of message ("This
adds a second iterator for the streaming case...." which is not
relavant now.
> > 11.
> > - * HeapTupleSatisfiesHistoricMVCC.
> > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > + *
> > + * We do build the hash table even if there are no CIDs. That's
> > + * because when streaming in-progress transactions we may run into
> > + * tuples with the CID before actually decoding them. Think e.g. about
> > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > + * yet when applying the INSERT. So we build a hash table so that
> > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > + *
> > + * XXX We might limit this behavior to streaming mode, and just bail
> > + * out when decoding transaction at commit time (at which point it's
> > + * guaranteed to see all CIDs).
> > */
> > static void
> > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > *rb, ReorderBufferTXN *txn)
> > dlist_iter iter;
> > HASHCTL hash_ctl;
> >
> > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > - return;
> > -
> >
> > I don't understand this change. Why would "INSERT followed by
> > TRUNCATE" could lead to a tuple which can come for decode before its
> > CID? The patch has made changes based on this assumption in
> > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> > behavior could be dependent on whether we are streaming the changes
> > for in-progress xact or at the commit of a transaction. We might want
> > to generate a test to once validate this behavior.
> >
> > Also, the comment refers to tqual.c which is wrong as this API is now
> > in heapam_visibility.c.
>
> Done.
>
+ * INSERT. So in such cases we assume the CIDs is from the future command
+ * and return as unresolve.
+ */
+ if (tuplecid_data == NULL)
+ return false;
+
Here lets reword the last line of comment as ". So in such cases we
assume the CID is from the future command."
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-19 12:31:36 |
Message-ID: | CAA4eK1+do3WbgGWKvzpFwGEeTw4qFj8ZhEw75rDMNEJgLRk0SQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
>
> > 3.
> > And, during catalog scan we can check the status of the xid and
> > + * if it is aborted we will report a specific error that we can ignore. We
> > + * might have already streamed some of the changes for the aborted
> > + * (sub)transaction, but that is fine because when we decode the abort we will
> > + * stream abort message to truncate the changes in the subscriber.
> > + */
> > +static inline void
> > +SetupCheckXidLive(TransactionId xid)
> >
> > In the above comment, I don't think it is right to say that we ignore
> > the error raised due to the aborted transaction. We need to say that
> > we discard the already streamed changes on such an error.
>
> Done.
>
In the same comment, there is typo (/messageto/message to).
> > 4.
> > +static inline void
> > +SetupCheckXidLive(TransactionId xid)
> > +{
> > /*
> > - * If this transaction has no snapshot, it didn't make any changes to the
> > - * database, so there's nothing to decode. Note that
> > - * ReorderBufferCommitChild will have transferred any snapshots from
> > - * subtransactions if there were any.
> > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > + * aborted. That will happen during catalog access. Also reset the
> > + * sysbegin_called flag.
> > */
> > - if (txn->base_snapshot == NULL)
> > + if (!TransactionIdDidCommit(xid))
> > {
> > - Assert(txn->ninvalidations == 0);
> > - ReorderBufferCleanupTXN(rb, txn);
> > - return;
> > + CheckXidAlive = xid;
> > + bsysscan = false;
> > }
> >
> > I think this function is inline as it needs to be called for each
> > change. If that is the case and otherwise also, isn't it better that
> > we check if passed xid is the same as CheckXidAlive before checking
> > TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> > calling it for each change might not be a good idea?
>
> Done, Also I think it is good the check the TransactionIdIsInProgress
> instead of !TransactionIdDidCommit. I have changed that as well.
>
What if it is aborted just before this check? I think the decode API
won't be able to detect that and sys* API won't care to check because
CheckXidAlive won't be set for that case.
> > 5.
> > setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > + * aborted. That will happen during catalog access. Also reset the
> > + * sysbegin_called flag.
> >
> > /if the xid aborted/if the xid is aborted. missing comma after Also.
>
> Done
>
You forgot to change as per the second part of the comment (missing
comma after Also).
>
> > 8.
> > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > * use as a normal record. It'll be cleaned up at the end
> > * of INSERT processing.
> > */
> > - if (specinsert == NULL)
> > - elog(ERROR, "invalid ordering of speculative insertion changes");
> >
> > You have removed this check but all other handling of specinsert is
> > same as far as this patch is concerned. Why so?
>
> Seems like a merge issue, or the leftover from the old design of the
> toast handling where we were streaming with the partial tuple.
> fixed now.
>
> > 9.
> > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > * freed/reused while restoring spooled data from
> > * disk.
> > */
> > - Assert(change->data.tp.newtuple != NULL);
> > -
> > dlist_delete(&change->node);
> >
> > Why is this Assert removed?
>
> Same cause as above so fixed.
>
> > 10.
> > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > relations[nrelations++] = relation;
> > }
> >
> > - rb->apply_truncate(rb, txn, nrelations, relations, change);
> > + if (streaming)
> > + {
> > + rb->stream_truncate(rb, txn, nrelations, relations, change);
> > +
> > + /* Remember that we have sent some data. */
> > + change->txn->any_data_sent = true;
> > + }
> > + else
> > + rb->apply_truncate(rb, txn, nrelations, relations, change);
> >
> > Can we encapsulate this in a separate function like
> > ReorderBufferApplyTruncate or something like that? Basically, rather
> > than having streaming check in this function, lets do it in some other
> > internal function. And we can likewise do it for all the streaming
> > checks in this function or at least whereever it is feasible. That
> > will make this function look clean.
>
> Done for truncate and change. I think we can create a few more such
> functions for
> start/stop and cleanup handling on error. I will work on that.
>
Yeah, I think that would be better.
One minor comment change suggestion:
/*
+ * start stream or begin the transaction. If this is the first
+ * change in the current stream.
+ */
We can write the above comment as "Start the stream or begin the
transaction for the first change in the current stream."
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 06:24:42 |
Message-ID: | CAA4eK1+aHfEafWOtjsbx8M6LacPCZtPTyfFrsk7kjZG854yB3Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
I have further reviewed v22 and below are my comments:
v22-0005-Implement-streaming-mode-in-ReorderBuffer
--------------------------------------------------------------------------
1.
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
The above 'Note' is not correct as per the latest implementation.
v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------
2.
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
*
*-------------------------------------------------------------------------
*/
-
#include "postgres.h"
Spurious line removal.
3.
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
+
+ Assert(TransactionIdIsValid(txn->xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, txn->xid);
The part of the comment "we're starting to stream, so must be valid"
is not correct as we are not at the start of the stream here. The
patch has used the same incorrect sentence at few places, kindly fix
those as well.
4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..
For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext. I
don't see any need of using TopMemoryContext here.
5.
+static void
+subxact_info_add(TransactionId xid)
This function has assumed a valid value for global variables like
stream_fd and stream_xid. I think it is better to have Assert for
those in this function before using them. The Assert for those are
present in handle_streamed_transaction but I feel they should be in
subxact_info_add.
6.
+subxact_info_add(TransactionId xid)
/*
+ * In most cases we're checking the same subxact as we've already seen in
+ * the last call, so make ure just ignore it (this change comes later).
+ */
+ if (subxact_last == xid)
+ return;
Typo and minor correction, /ure just/sure to
7.
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * But we free the memory allocated for subxact info. There might be one
+ * exceptional transaction with many subxacts, and we don't want to keep
+ * the memory allocated forewer.
+ *
+ */
a. Typo, /forewer/forever
b. The extra line at the end of the comment is not required.
8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums. Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?
Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change. So, not sure, there is any sense to write
checksum for subxact file.
Tomas, do you see any reason for the same?
9.
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ */
+ if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ tempdirpath)));
+
+ snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+ tempdirpath, subid, xid);
+}
Temporary files created in PGDATA/base/pgsql_tmp follow a certain
naming convention (see docs[1]) which is not followed here. You can
also refer SharedFileSetPath and OpenTemporaryFile. I think we can
just try to follow that convention and then additionally append subid,
xid and .subxacts. Also, a similar change is required for
changes_filename. I would like to know if there is a reason why we
want to use different naming convention here?
10.
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)
The comment seems to be wrong. I think this can be only called at
stream end, so it should be "This can only be called at the end of a
"streaming" block, i.e. at stream_stop message from the upstream."
11.
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
-
+ TransactionId xid; /* transaction that created the record */
/*
* Did we send the schema? If ancestor relid is set, its schema must also
* have been sent for this to be true.
*/
bool schema_sent;
+ List *streamed_txns; /* streamed toplevel transactions with this
+ * schema */
The part of comment "So streamed trasactions are handled separately by
using schema_sent flag in ReorderBufferTXN." doesn't seem to match
with what we are doing in the latest version of the patch.
12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}
I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation. Do we see any
problem with that mechanism?
13.
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
This comment is copied from pgoutput_stream_abort, so doesn't match
what this function is doing.
[1] - https://www.postgresql.org/docs/devel/storage-file-layout.html
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 11:16:42 |
Message-ID: | CAA4eK1Jr=p9v9Wqaph1nrCkqJxA=bHawypsn7oTEE_SvcwhjRg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> v22-0006-Add-support-for-streaming-to-built-in-replicatio
> ----------------------------------------------------------------------------
>
Few more comments on v22-0006 patch:
1.
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+ int i;
+ char path[MAXPGPATH];
+ bool found = false;
+
+ subxact_filename(path, subid, xid);
+
+ if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
Here, we have unlinked the files containing information of subxacts
but don't we need to free the corresponding memory (memory for
subxacts) as well?
2.
apply_handle_stream_abort()
{
..
+ subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+
+ return;
..
}
Like the previous comment, it seems here also we need to free subxacts
memory and additionally we forgot to adjust the xids array as well.
3.
apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return;
..
}
Is it possible that we didn't find the xid in subxacts array? If so,
I think we should mention the same in comments, otherwise, we should
have an assert for found.
4.
apply_handle_stream_abort()
{
..
+ changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (truncate(path, subxacts[subidx].offset))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m", path)));
..
}
Will truncate works on Windows? I see in the code we ftruncate which
is defined as chsize in win32.h and win32_port.h. I have not tested
this so I am not very sure about this. I got a below warning when I
tried to compile this code on Windows. I think it is better to
ftruncate as it is used at other places in the code as well.
worker.c(798): warning C4013: 'truncate' undefined; assuming extern
returning int
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 12:51:36 |
Message-ID: | CAFiTN-t8jD_LMtm60kjbezhT3GouwO-WWTRwuWpPLj7F7iStRQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
>
> Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> 1.
> + /*
> + * If this is a toast insert then set the corresponding bit. Otherwise, if
> + * we have toast insert bit set and this is insert/update then clear the
> + * bit.
> + */
> + if (toast_insert)
> + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> + else if (rbtxn_has_toast_insert(txn) &&
> + ChangeIsInsertOrUpdate(change->action))
> + {
>
> Here, it might better to add a comment on why we expect only
> Insert/Update? Also, it might be better that we add an assert for
> other operations.
I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update, we might get the
toast delete right? because in toast update we will do toast delete +
toast insert. So when we get toast delete we just don't want to do
anything.
>
> 2.
> @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
> * disk.
> */
> dlist_delete(&change->node);
> - ReorderBufferToastAppendChunk(rb, txn, relation,
> - change);
> + ReorderBufferToastAppendChunk(rb, txn, relation,
> + change);
> }
>
> This seems to be a spurious change.
Done
> 3.
> + /*
> + * If streaming is enable and we have serialized this transaction because
> + * it had incomplete tuple. So if now we have got the complete tuple we
> + * can stream it.
> + */
> + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> + {
>
> This comment is just saying what you are doing in the if-check. I
> think you need to explain the rationale behind it. I don't like the
> variable name 'can_stream' because it matches ReorderBufferCanStream
> whereas it is for a different purpose, how about naming it as
> 'change_complete' or something like that. The check has many
> conditions, can we move it to a separate function to make the code
> here look clean?
As per the other comments we have removed this part in the latest patch set.
Apart from these comments fixes, there are 2 more changes
1. Handling of the toast tuple is changed as per the offlist
discussion with you
Basically, now, instead of not streaming the txn with the incomplete
tuple, we are streaming it up to the last complete lsn. So of the txn
has incomplete changes but its complete size is largest then we will
stream this. And, after streaming we will truncate the transaction up
to the last complete lsn.
2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 12:51:50 |
Message-ID: | CAFiTN-sC=igkzewbcPUROXd4xP_bsFRe1=VVeRGOHB-jB2sssw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Review comments:
> > > ------------------------------
> > > 1.
> > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > TransactionId xid,
> > > }
> > >
> > > case REORDER_BUFFER_CHANGE_MESSAGE:
> > > - rb->message(rb, txn, change->lsn, true,
> > > - change->data.msg.prefix,
> > > - change->data.msg.message_size,
> > > - change->data.msg.message);
> > > + if (streaming)
> > > + rb->stream_message(rb, txn, change->lsn, true,
> > > + change->data.msg.prefix,
> > > + change->data.msg.message_size,
> > > + change->data.msg.message);
> > > + else
> > > + rb->message(rb, txn, change->lsn, true,
> > > + change->data.msg.prefix,
> > > + change->data.msg.message_size,
> > > + change->data.msg.message);
> > >
> > > Don't we need to set any_data_sent flag while streaming messages as we
> > > do for other types of changes?
> >
> > I think any_data_sent, was added to avoid sending abort to the
> > subscriber if we haven't sent any data, but this is not complete as
> > the output plugin can also take the decision not to send. So I think
> > this should not be done as part of this patch and can be done
> > separately. I think there is already a thread for handling the
> > same[1]
> >
>
> Hmm, but prior to this patch, we never use to send (empty) aborts but
> now that will be possible. It is probably okay to deal that with
> another patch mentioned by you but I felt at least any_data_sent will
> work for some cases. OTOH, it appears to be half-baked solution, so
> we should probably refrain from adding it. BTW, how do the pgoutput
> plugin deal with it? I see that apply_handle_stream_abort will
> unconditionally try to unlink the file and it will probably fail.
> Have you tested this scenario after your latest changes?
Yeah, I see, I think this is a problem, but this exists without my
latest change as well, if pgoutput ignore some changes because it is
not published then we will see a similar error. Shall we handle the
ENOENT error case from unlink? I think the best idea is that we shall
track the empty transaction.
> > > 4.
> > > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> > > the try and catch block. If there is an error after calling it in a
> > > try block, we might call it again via catch. I think that will lead
> > > to sending a stop message twice. Won't that be a problem? See the
> > > usage of iterstate in the catch block, we have made it safe from a
> > > similar problem.
> >
> > IMHO, we don't need that, because we only call stream_stop in the
> > catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if
> > in TRY block we have already stopped the stream then we should not get
> > that error. I have added the comments for the same.
> >
>
> I am still slightly nervous about it as I don't see any solid
> guarantee for the same. You are right as the code stands today but
> due to any code that gets added in the future, it might not remain
> true. I feel it is better to have an Assert here to ensure that
> stream_stop won't be called the second time. I don't see any good way
> of doing it other than by maintaining flag or some state but I think
> it will be good to ensure this.
Done
> > > 6.
> > > PG_CATCH();
> > > {
> > > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> > > + ErrorData *errdata = CopyErrorData();
> > >
> > > I don't understand the usage of memory context in this part of the
> > > code. Basically, you are switching to CurrentMemoryContext here, do
> > > some error handling and then again reset back to some random context
> > > before rethrowing the error. If there is some purpose for it, then it
> > > might be better if you can write a few comments to explain the same.
> >
> > Basically, the ccxt is the CurrentMemoryContext when we started the
> > streaming and ecxt it the context when we catch the error. So
> > ideally, before this change, it will rethrow in the context when we
> > catch the error i.e. ecxt. So what we are trying to do is put it back
> > to normal context (ccxt) and copy the error data in the normal
> > context. And, if we are not handling it gracefully then put it back
> > to the context it was in, and rethrow.
> >
>
> Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
> we need to clean up the reorderbuffer by calling
> ReorderBufferCleanupTXN? If so, then you can try to combine it with
> the not-streaming else loop.
Done
> > > 8.
> > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > *rb, TransactionId xid,
> > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > >
> > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > +
> > > + /*
> > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > + * if one of its children has.
> > > + */
> > > + if (txn->toptxn != NULL)
> > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > }
> > >
> > > Why are we marking top transaction here?
> >
> > We need to mark top transaction to decide whether to build tuplecid
> > hash or not. In non-streaming mode, we are only sending during the
> > commit time, and during commit time we know whether the top
> > transaction has any catalog changes or not based on the invalidation
> > message so we are marking the top transaction there in DecodeCommit.
> > Since here we are not waiting till commit so we need to mark the top
> > transaction as soon as we mark any of its child transactions.
> >
>
> But how does it help? We use this flag (via
> ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> anyway done in DecodeCommit and that too after setting this flag for
> the top transaction if required. So, how will it help in setting it
> while processing for subxid. Also, even if we have to do it won't it
> add the xid needlessly in builder->committed.xip array?
In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 12:52:08 |
Message-ID: | CAFiTN-t6usE6=yxNEsyduevCRY5OSgaedun8o_MYN+_vgNxOUg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 4:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > 3.
> > > > + /*
> > > > + * If streaming is enable and we have serialized this transaction because
> > > > + * it had incomplete tuple. So if now we have got the complete tuple we
> > > > + * can stream it.
> > > > + */
> > > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > > > + {
> > > >
> > > > This comment is just saying what you are doing in the if-check. I
> > > > think you need to explain the rationale behind it. I don't like the
> > > > variable name 'can_stream' because it matches ReorderBufferCanStream
> > > > whereas it is for a different purpose, how about naming it as
> > > > 'change_complete' or something like that. The check has many
> > > > conditions, can we move it to a separate function to make the code
> > > > here look clean?
> > > >
> > >
> > > Do we really need this? Immediately after this check, we are calling
> > > ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> > > required.
> >
> > Actually, ReorderBufferCheckMemoryLimit is only meant for checking
> > whether we need to stream the changes due to the memory limit. But
> > suppose when memory limit exceeds that time we could not stream the
> > transaction because there was only incomplete toast insert so we
> > serialized. Now, when we get the tuple which makes the changes
> > complete but now it is not crossing the memory limit as changes were
> > already serialized. So I am not sure whether it is a good idea to
> > stream the transaction as soon as we get the complete changes or we
> > shall wait till next time memory limit exceed and that time we select
> > the suitable candidate.
> >
>
> I think it is better to wait till next time we exceed the memory threshold.
Okay, done this way.
> > Ideally, we were are in streaming more and
> > the transaction is serialized means it was already a candidate for
> > streaming but could not stream due to the incomplete changes so
> > shouldn't we stream it immediately as soon as its changes are complete
> > even though now we are in memory limit.
> >
>
> The only time we need to stream or spill is when we exceed memory
> threshold. In the above case, it is possible that next time there is
> some other candidate transaction that we can stream.
>
> > >
> > > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
> > >
> > > + else if (rbtxn_has_toast_insert(txn) &&
> > > + ChangeIsInsertOrUpdate(change->action))
> > > + {
> > > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> > > + can_stream = true;
> > > + }
> > > ..
> > > +#define ChangeIsInsertOrUpdate(action) \
> > > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> > > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> > > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
> > >
> > > How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> > > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
> >
> > Partial toast insert means we have inserted in the toast but not in
> > the main table. So even if it is spec insert we can form the complete
> > tuple, however, we can still not stream it because we haven't got
> > spec_confirm but for that, we are marking another flag. So if the
> > insert is aspect insert the toast insert will also be spec insert and
> > as part of that toast, spec inserts we are marking partial tuple so
> > cleaning that flag should happen when the spec insert is done for the
> > main table right?
> >
>
> Sounds reasonable.
ok
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 12:52:22 |
Message-ID: | CAFiTN-tezRJkT770iz4pdjb5_hsH9X=ZFfqPLLzU4EcziqCNgw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 5:34 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> >
> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here? I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > Done
> >
>
> Now after your change in stream_start_cb_wrapper, we assign
> report_location as first_lsn passed as input to function but
> write_location is still txn->first_lsn. Shouldn't we assing passed in
> first_lsn to write_location? It seems assigning txn->first_lsn won't
> be correct for streams other than first-one.
Done
>
> > > 5.
> > > +static void
> > > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_stop";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Can't we report txn->final_lsn here
> >
> > We are already setting this to the txn->final_ls in 0006 patch, but I
> > have moved it into this patch now.
> >
>
> Similar to previous point, here also, I think we need to assign report
> and write location as last_lsn passed to this API.
Done
> >
> > > v20-0005-Implement-streaming-mode-in-ReorderBuffer
> > > -----------------------------------------------------------------------------
> > > 10.
> > > Theoretically, we could get rid of the k-way merge, and append the
> > > changes to the toplevel xact directly (and remember the position
> > > in the list in case the subxact gets aborted later).
> > >
> > > I don't think this part of the commit message is correct as we
> > > sometimes need to spill even during streaming. Please check the
> > > entire commit message and update according to the latest
> > > implementation.
> >
> > Done
> >
>
> You seem to forgot about removing the other part of message ("This
> adds a second iterator for the streaming case...." which is not
> relavant now.
Done
> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > > */
> > > static void
> > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > > dlist_iter iter;
> > > HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change. Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID? The patch has made changes based on this assumption in
> > > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> > > behavior could be dependent on whether we are streaming the changes
> > > for in-progress xact or at the commit of a transaction. We might want
> > > to generate a test to once validate this behavior.
> > >
> > > Also, the comment refers to tqual.c which is wrong as this API is now
> > > in heapam_visibility.c.
> >
> > Done.
> >
>
> + * INSERT. So in such cases we assume the CIDs is from the future command
> + * and return as unresolve.
> + */
> + if (tuplecid_data == NULL)
> + return false;
> +
>
> Here lets reword the last line of comment as ". So in such cases we
> assume the CID is from the future command."
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-22 12:52:41 |
Message-ID: | CAFiTN-smjthT3fD6gipK9YaEvRRLgvL1P1zA-0jOY6cO8oVOUg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> >
> > > 3.
> > > And, during catalog scan we can check the status of the xid and
> > > + * if it is aborted we will report a specific error that we can ignore. We
> > > + * might have already streamed some of the changes for the aborted
> > > + * (sub)transaction, but that is fine because when we decode the abort we will
> > > + * stream abort message to truncate the changes in the subscriber.
> > > + */
> > > +static inline void
> > > +SetupCheckXidLive(TransactionId xid)
> > >
> > > In the above comment, I don't think it is right to say that we ignore
> > > the error raised due to the aborted transaction. We need to say that
> > > we discard the already streamed changes on such an error.
> >
> > Done.
> >
>
> In the same comment, there is typo (/messageto/message to).
Done
> > > 4.
> > > +static inline void
> > > +SetupCheckXidLive(TransactionId xid)
> > > +{
> > > /*
> > > - * If this transaction has no snapshot, it didn't make any changes to the
> > > - * database, so there's nothing to decode. Note that
> > > - * ReorderBufferCommitChild will have transferred any snapshots from
> > > - * subtransactions if there were any.
> > > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > > + * aborted. That will happen during catalog access. Also reset the
> > > + * sysbegin_called flag.
> > > */
> > > - if (txn->base_snapshot == NULL)
> > > + if (!TransactionIdDidCommit(xid))
> > > {
> > > - Assert(txn->ninvalidations == 0);
> > > - ReorderBufferCleanupTXN(rb, txn);
> > > - return;
> > > + CheckXidAlive = xid;
> > > + bsysscan = false;
> > > }
> > >
> > > I think this function is inline as it needs to be called for each
> > > change. If that is the case and otherwise also, isn't it better that
> > > we check if passed xid is the same as CheckXidAlive before checking
> > > TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> > > calling it for each change might not be a good idea?
> >
> > Done, Also I think it is good the check the TransactionIdIsInProgress
> > instead of !TransactionIdDidCommit. I have changed that as well.
> >
>
> What if it is aborted just before this check? I think the decode API
> won't be able to detect that and sys* API won't care to check because
> CheckXidAlive won't be set for that case.
Yeah, that's the problem, I think it should be TransactionIdDidCommit only.
> > > 5.
> > > setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > > + * aborted. That will happen during catalog access. Also reset the
> > > + * sysbegin_called flag.
> > >
> > > /if the xid aborted/if the xid is aborted. missing comma after Also.
> >
> > Done
> >
>
> You forgot to change as per the second part of the comment (missing
> comma after Also).
Done
> > > 8.
> > > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > * use as a normal record. It'll be cleaned up at the end
> > > * of INSERT processing.
> > > */
> > > - if (specinsert == NULL)
> > > - elog(ERROR, "invalid ordering of speculative insertion changes");
> > >
> > > You have removed this check but all other handling of specinsert is
> > > same as far as this patch is concerned. Why so?
> >
> > Seems like a merge issue, or the leftover from the old design of the
> > toast handling where we were streaming with the partial tuple.
> > fixed now.
> >
> > > 9.
> > > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > * freed/reused while restoring spooled data from
> > > * disk.
> > > */
> > > - Assert(change->data.tp.newtuple != NULL);
> > > -
> > > dlist_delete(&change->node);
> > >
> > > Why is this Assert removed?
> >
> > Same cause as above so fixed.
> >
> > > 10.
> > > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > relations[nrelations++] = relation;
> > > }
> > >
> > > - rb->apply_truncate(rb, txn, nrelations, relations, change);
> > > + if (streaming)
> > > + {
> > > + rb->stream_truncate(rb, txn, nrelations, relations, change);
> > > +
> > > + /* Remember that we have sent some data. */
> > > + change->txn->any_data_sent = true;
> > > + }
> > > + else
> > > + rb->apply_truncate(rb, txn, nrelations, relations, change);
> > >
> > > Can we encapsulate this in a separate function like
> > > ReorderBufferApplyTruncate or something like that? Basically, rather
> > > than having streaming check in this function, lets do it in some other
> > > internal function. And we can likewise do it for all the streaming
> > > checks in this function or at least whereever it is feasible. That
> > > will make this function look clean.
> >
> > Done for truncate and change. I think we can create a few more such
> > functions for
> > start/stop and cleanup handling on error. I will work on that.
> >
>
> Yeah, I think that would be better.
I have done some refactoring, please look into the latest version.
> One minor comment change suggestion:
> /*
> + * start stream or begin the transaction. If this is the first
> + * change in the current stream.
> + */
>
> We can write the above comment as "Start the stream or begin the
> transaction for the first change in the current stream."
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-25 14:37:36 |
Message-ID: | CAFiTN-tW35oa_xNNMTKJ7LPLtaanjSwaEC15LrAixHi-i=DBmw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
>
> I have further reviewed v22 and below are my comments:
>
> v22-0005-Implement-streaming-mode-in-ReorderBuffer
> --------------------------------------------------------------------------
> 1.
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
>
> The above 'Note' is not correct as per the latest implementation.
That is removed in 0010 in the latest version you can see in 0006.
> v22-0006-Add-support-for-streaming-to-built-in-replicatio
> ----------------------------------------------------------------------------
> 2.
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,7 +14,6 @@
> *
> *-------------------------------------------------------------------------
> */
> -
> #include "postgres.h"
>
> Spurious line removal.
Fixed
> 3.
> +void
> +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
> + XLogRecPtr commit_lsn)
> +{
> + uint8 flags = 0;
> +
> + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
> +
> + Assert(TransactionIdIsValid(txn->xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, txn->xid);
>
> The part of the comment "we're starting to stream, so must be valid"
> is not correct as we are not at the start of the stream here. The
> patch has used the same incorrect sentence at few places, kindly fix
> those as well.
I have removed that part of the comment.
> 4.
> + * XXX Do we need to allocate it in TopMemoryContext?
> + */
> +static void
> +subxact_info_add(TransactionId xid)
> {
> ..
>
> For this and other places in a patch like in function
> stream_open_file(), instead of using TopMemoryContext, can we consider
> using a new memory context LogicalStreamingContext or something like
> that. We can create LogicalStreamingContext under TopMemoryContext. I
> don't see any need of using TopMemoryContext here.
But, when we will delete/reset the LogicalStreamingContext? because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context. If we create any other context
with the same life span as TopMemoryContext then what is the point?
Am I missing something?
> 5.
> +static void
> +subxact_info_add(TransactionId xid)
>
> This function has assumed a valid value for global variables like
> stream_fd and stream_xid. I think it is better to have Assert for
> those in this function before using them. The Assert for those are
> present in handle_streamed_transaction but I feel they should be in
> subxact_info_add.
Done
> 6.
> +subxact_info_add(TransactionId xid)
> /*
> + * In most cases we're checking the same subxact as we've already seen in
> + * the last call, so make ure just ignore it (this change comes later).
> + */
> + if (subxact_last == xid)
> + return;
>
> Typo and minor correction, /ure just/sure to
Done
> 7.
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + /*
> + * But we free the memory allocated for subxact info. There might be one
> + * exceptional transaction with many subxacts, and we don't want to keep
> + * the memory allocated forewer.
> + *
> + */
>
> a. Typo, /forewer/forever
> b. The extra line at the end of the comment is not required.
Done
> 8.
> + * XXX Maybe we should only include the checksum when the cluster is
> + * initialized with checksums?
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
>
> Do we really need to have the checksum for temporary files? I have
> checked a few other similar cases like SharedFileSet stuff for
> parallel hash join but didn't find them using checksums. Can you also
> once see other usages of temporary files and then let us decide if we
> see any reason to have checksums for this?
Yeah, even I can see other places checksum is not used.
>
> Another point is we don't seem to be doing this for 'changes' file,
> see stream_write_change. So, not sure, there is any sense to write
> checksum for subxact file.
I can see there are comment atop this function
* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.
>
> Tomas, do you see any reason for the same?
> 9.
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
> +
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + */
> + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not create directory \"%s\": %m",
> + tempdirpath)));
> +
> + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
> + tempdirpath, subid, xid);
> +}
>
> Temporary files created in PGDATA/base/pgsql_tmp follow a certain
> naming convention (see docs[1]) which is not followed here. You can
> also refer SharedFileSetPath and OpenTemporaryFile. I think we can
> just try to follow that convention and then additionally append subid,
> xid and .subxacts. Also, a similar change is required for
> changes_filename. I would like to know if there is a reason why we
> want to use different naming convention here?
I have changed it to this: pgsql_tmpPID-subid-xid.subxacts.
> 10.
> + * This can only be called at the beginning of a "streaming" block, i.e.
> + * between stream_start/stream_stop messages from the upstream.
> + */
> +static void
> +stream_close_file(void)
>
> The comment seems to be wrong. I think this can be only called at
> stream end, so it should be "This can only be called at the end of a
> "streaming" block, i.e. at stream_stop message from the upstream."
Right, I have fixed it.
> 11.
> + * the order the transactions are sent in. So streamed trasactions are
> + * handled separately by using schema_sent flag in ReorderBufferTXN.
> + *
> * For partitions, 'pubactions' considers not only the table's own
> * publications, but also those of all of its ancestors.
> */
> typedef struct RelationSyncEntry
> {
> Oid relid; /* relation oid */
> -
> + TransactionId xid; /* transaction that created the record */
> /*
> * Did we send the schema? If ancestor relid is set, its schema must also
> * have been sent for this to be true.
> */
> bool schema_sent;
> + List *streamed_txns; /* streamed toplevel transactions with this
> + * schema */
>
> The part of comment "So streamed trasactions are handled separately by
> using schema_sent flag in ReorderBufferTXN." doesn't seem to match
> with what we are doing in the latest version of the patch.
Yeah, it's wrong, I have fixed it.
> 12.
> maybe_send_schema()
> {
> ..
> + if (in_streaming)
> + {
> + /*
> + * TOCHECK: We have to send schema after each catalog change and it may
> + * occur when streaming already started, so we have to track new catalog
> + * changes somehow.
> + */
> + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> ..
> ..
> }
>
> I think it is good to once verify/test what this comment says but as
> per code we should be sending the schema after each catalog change as
> we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> which must be called during relcache invalidation. Do we see any
> problem with that mechanism?
I have tested this, I think we are already sending the schema after
each catalog change.
> 13.
> +/*
> + * Notify downstream to discard the streamed transaction (along with all
> + * it's subtransactions, if it's a toplevel transaction).
> + */
> +static void
> +pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + XLogRecPtr commit_lsn)
>
> This comment is copied from pgoutput_stream_abort, so doesn't match
> what this function is doing.
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v24.tar | application/x-tar | 577.0 KB |
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-25 14:37:49 |
Message-ID: | CAFiTN-sts45LRr_Aj=e0McLTug0ooqG+u9i_d02xYjq9Q0zhJw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 22, 2020 at 4:46 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > v22-0006-Add-support-for-streaming-to-built-in-replicatio
> > ----------------------------------------------------------------------------
> >
> Few more comments on v22-0006 patch:
>
> 1.
> +stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
> +{
> + int i;
> + char path[MAXPGPATH];
> + bool found = false;
> +
> + subxact_filename(path, subid, xid);
> +
> + if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
>
> Here, we have unlinked the files containing information of subxacts
> but don't we need to free the corresponding memory (memory for
> subxacts) as well?
Basically, stream_cleanup_files, is used for
1) cleanup file on worker exit
2) while writing the first segment of the xid we clean up to ensure
there are no orphaned file with same xid.
3) After apply commit we clean up the file.
Whereas subxacts memory is used between the stream start and stream
stop as soon stream stop we write the subxacts changes to file and
free the memory. So there is no case that we can have subxact memory
at stream_cleanup_files, except on worker exit but there we are
already exiting the worker. IMHO we don't need to free memory there.
> 2.
> apply_handle_stream_abort()
> {
> ..
> + subxact_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (unlink(path) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
> +
> + return;
> ..
> }
>
> Like the previous comment, it seems here also we need to free subxacts
> memory and additionally we forgot to adjust the xids array as well.
In this, we are allocating memory in subxact_info_read, but we are
again calling subxact_info_write which will free the memory.
> 3.
> apply_handle_stream_abort()
> {
> ..
> + /* XXX optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found)
> + return;
> ..
> }
>
> Is it possible that we didn't find the xid in subxacts array? If so,
> I think we should mention the same in comments, otherwise, we should
> have an assert for found.
We may not find due to the empty transaction, I have changed the comments.
> 4.
> apply_handle_stream_abort()
> {
> ..
> + changes_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (truncate(path, subxacts[subidx].offset))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not truncate file \"%s\": %m", path)));
> ..
> }
>
> Will truncate works on Windows? I see in the code we ftruncate which
> is defined as chsize in win32.h and win32_port.h. I have not tested
> this so I am not very sure about this. I got a below warning when I
> tried to compile this code on Windows. I think it is better to
> ftruncate as it is used at other places in the code as well.
>
> worker.c(798): warning C4013: 'truncate' undefined; assuming extern
> returning int
I have changed to the ftruncate.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-25 15:18:49 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2020-05-25 16:37, Dilip Kumar wrote:
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>> >
>> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> > >
>>
>> I have further reviewed v22 and below are my comments:
>>
>> [v24.tar]
Hi,
I am not able to extract all files correctly from this tar.
The first file v24-0001-* seems to have some 'binary' junk at the top.
(The other 11 files seem normally readably)
Erik Rijkers
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 02:15:19 |
Message-ID: | CAFiTN-tOvoSMQLMNxHQtF2Ri3-W8ABKSPJGN5f9u-pHhv4nDEw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>
> Hi,
>
> I am not able to extract all files correctly from this tar.
>
> The first file v24-0001-* seems to have some 'binary' junk at the top.
>
> (The other 11 files seem normally readably)
Okay, sending again.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v24.tar | application/x-tar | 311.0 KB |
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 04:57:27 |
Message-ID: | CAA4eK1Lg-ysApsF70MnazJJC9MydTrSnwMAtV0RT-aF=Re0qRw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > 1.
> > + /*
> > + * If this is a toast insert then set the corresponding bit. Otherwise, if
> > + * we have toast insert bit set and this is insert/update then clear the
> > + * bit.
> > + */
> > + if (toast_insert)
> > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > + else if (rbtxn_has_toast_insert(txn) &&
> > + ChangeIsInsertOrUpdate(change->action))
> > + {
> >
> > Here, it might better to add a comment on why we expect only
> > Insert/Update? Also, it might be better that we add an assert for
> > other operations.
>
> I have added comments that why on Insert/Update we clean the flag.
> But I don't think we only expect insert/update, we might get the
> toast delete right? because in toast update we will do toast delete +
> toast insert. So when we get toast delete we just don't want to do
> anything.
>
Okay, that makes sense.
> >
> > 2.
> > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn,
> > * disk.
> > */
> > dlist_delete(&change->node);
> > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > - change);
> > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > + change);
> > }
> >
> > This seems to be a spurious change.
>
> Done
>
> 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> was 0006).
>
The code changes look fine but it is not clear what was the exact
issue. Can you explain?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 06:30:12 |
Message-ID: | CAA4eK1KWZZY3rnfMCv7ywcUju-7aPPV6ktjvRg=8SSaRDGh1_g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > Review comments:
> > > > ------------------------------
> > > > 1.
> > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > > TransactionId xid,
> > > > }
> > > >
> > > > case REORDER_BUFFER_CHANGE_MESSAGE:
> > > > - rb->message(rb, txn, change->lsn, true,
> > > > - change->data.msg.prefix,
> > > > - change->data.msg.message_size,
> > > > - change->data.msg.message);
> > > > + if (streaming)
> > > > + rb->stream_message(rb, txn, change->lsn, true,
> > > > + change->data.msg.prefix,
> > > > + change->data.msg.message_size,
> > > > + change->data.msg.message);
> > > > + else
> > > > + rb->message(rb, txn, change->lsn, true,
> > > > + change->data.msg.prefix,
> > > > + change->data.msg.message_size,
> > > > + change->data.msg.message);
> > > >
> > > > Don't we need to set any_data_sent flag while streaming messages as we
> > > > do for other types of changes?
> > >
> > > I think any_data_sent, was added to avoid sending abort to the
> > > subscriber if we haven't sent any data, but this is not complete as
> > > the output plugin can also take the decision not to send. So I think
> > > this should not be done as part of this patch and can be done
> > > separately. I think there is already a thread for handling the
> > > same[1]
> > >
> >
> > Hmm, but prior to this patch, we never use to send (empty) aborts but
> > now that will be possible. It is probably okay to deal that with
> > another patch mentioned by you but I felt at least any_data_sent will
> > work for some cases. OTOH, it appears to be half-baked solution, so
> > we should probably refrain from adding it. BTW, how do the pgoutput
> > plugin deal with it? I see that apply_handle_stream_abort will
> > unconditionally try to unlink the file and it will probably fail.
> > Have you tested this scenario after your latest changes?
>
> Yeah, I see, I think this is a problem, but this exists without my
> latest change as well, if pgoutput ignore some changes because it is
> not published then we will see a similar error. Shall we handle the
> ENOENT error case from unlink?
>
Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort? If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that? Can you please once try to
reproduce this scenario to ensure that we are not missing anything?
>
>
> > > > 8.
> > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > *rb, TransactionId xid,
> > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > >
> > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > +
> > > > + /*
> > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > + * if one of its children has.
> > > > + */
> > > > + if (txn->toptxn != NULL)
> > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > }
> > > >
> > > > Why are we marking top transaction here?
> > >
> > > We need to mark top transaction to decide whether to build tuplecid
> > > hash or not. In non-streaming mode, we are only sending during the
> > > commit time, and during commit time we know whether the top
> > > transaction has any catalog changes or not based on the invalidation
> > > message so we are marking the top transaction there in DecodeCommit.
> > > Since here we are not waiting till commit so we need to mark the top
> > > transaction as soon as we mark any of its child transactions.
> > >
> >
> > But how does it help? We use this flag (via
> > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > anyway done in DecodeCommit and that too after setting this flag for
> > the top transaction if required. So, how will it help in setting it
> > while processing for subxid. Also, even if we have to do it won't it
> > add the xid needlessly in builder->committed.xip array?
>
> In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> to build the tuplecid hash or not based on whether it has catalog
> changes or not.
>
Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?" IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right? If this is
correct, please explain why we are doing so in the comments.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 09:13:59 |
Message-ID: | CAFiTN-sO6VAUy5kaft4=EK-yu52Fc2s_QoxSiVFm1H7erM6g3g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > > 1.
> > > + /*
> > > + * If this is a toast insert then set the corresponding bit. Otherwise, if
> > > + * we have toast insert bit set and this is insert/update then clear the
> > > + * bit.
> > > + */
> > > + if (toast_insert)
> > > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > > + else if (rbtxn_has_toast_insert(txn) &&
> > > + ChangeIsInsertOrUpdate(change->action))
> > > + {
> > >
> > > Here, it might better to add a comment on why we expect only
> > > Insert/Update? Also, it might be better that we add an assert for
> > > other operations.
> >
> > I have added comments that why on Insert/Update we clean the flag.
> > But I don't think we only expect insert/update, we might get the
> > toast delete right? because in toast update we will do toast delete +
> > toast insert. So when we get toast delete we just don't want to do
> > anything.
> >
>
> Okay, that makes sense.
>
> > >
> > > 2.
> > > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn,
> > > * disk.
> > > */
> > > dlist_delete(&change->node);
> > > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > > - change);
> > > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > > + change);
> > > }
> > >
> > > This seems to be a spurious change.
> >
> > Done
> >
> > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > was 0006).
> >
>
> The code changes look fine but it is not clear what was the exact
> issue. Can you explain?
Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory. So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 09:34:02 |
Message-ID: | CAA4eK1+C=eqqUunV29N=8BcKvLk6wEx41zDQCWWVSx+zT2z-VQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > 4.
> > + * XXX Do we need to allocate it in TopMemoryContext?
> > + */
> > +static void
> > +subxact_info_add(TransactionId xid)
> > {
> > ..
> >
> > For this and other places in a patch like in function
> > stream_open_file(), instead of using TopMemoryContext, can we consider
> > using a new memory context LogicalStreamingContext or something like
> > that. We can create LogicalStreamingContext under TopMemoryContext. I
> > don't see any need of using TopMemoryContext here.
>
> But, when we will delete/reset the LogicalStreamingContext?
>
Why can't we reset it at each stream stop message?
> because
> we are planning to keep this memory until the worker is alive so that
> supposed to be the top memory context.
>
Which part of allocation do we want to keep till the worker is alive?
Why we need memory-related to subxacts till the worker is alive? As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.
I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message. That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.
> If we create any other context
> with the same life span as TopMemoryContext then what is the point?
>
It is helpful for debugging. It is recommended that we don't use the
top memory context unless it is really required. Read about it in
src/backend/utils/mmgr/README.
>
> > 8.
> > + * XXX Maybe we should only include the checksum when the cluster is
> > + * initialized with checksums?
> > + */
> > +static void
> > +subxact_info_write(Oid subid, TransactionId xid)
> >
> > Do we really need to have the checksum for temporary files? I have
> > checked a few other similar cases like SharedFileSet stuff for
> > parallel hash join but didn't find them using checksums. Can you also
> > once see other usages of temporary files and then let us decide if we
> > see any reason to have checksums for this?
>
> Yeah, even I can see other places checksum is not used.
>
So, unless someone speaks up before you are ready for the next version
of the patch, can we remove it?
> >
> > Another point is we don't seem to be doing this for 'changes' file,
> > see stream_write_change. So, not sure, there is any sense to write
> > checksum for subxact file.
>
> I can see there are comment atop this function
>
> * XXX The subxact file includes CRC32C of the contents. Maybe we should
> * include something like that here too, but doing so will not be as
> * straighforward, because we write the file in chunks.
>
You can remove this comment as well. I don't know how advantageous it
is to checksum temporary files. We can anyway add it later if there
is a reason for doing so.
>
>
> > 12.
> > maybe_send_schema()
> > {
> > ..
> > + if (in_streaming)
> > + {
> > + /*
> > + * TOCHECK: We have to send schema after each catalog change and it may
> > + * occur when streaming already started, so we have to track new catalog
> > + * changes somehow.
> > + */
> > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> > ..
> > ..
> > }
> >
> > I think it is good to once verify/test what this comment says but as
> > per code we should be sending the schema after each catalog change as
> > we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> > which must be called during relcache invalidation. Do we see any
> > problem with that mechanism?
>
> I have tested this, I think we are already sending the schema after
> each catalog change.
>
Then remove "TOCHECK" in the above comment.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-26 11:16:01 |
Message-ID: | CAA4eK1JP2a=hE1J6iXEkEMyxVQWdgh-73y+WyHAJ5umX2-Y9tg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > >
> > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > was 0006).
> > >
> >
> > The code changes look fine but it is not clear what was the exact
> > issue. Can you explain?
>
> Basically, in case of an empty subtransaction, we were reading the
> subxacts info but when we could not find the subxid in the subxacts
> info we were not releasing the memory. So on next subxact_info_read
> it will expect that subxacts should be freed but we did not free it in
> that !found case.
>
Okay, on looking at it again, the same code exists in
subxact_info_write as well. It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact? As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it. Having it allocated in memory context which we can
reset later might reduce that risk. One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there. In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From: | Mahendra Singh Thalor <mahi6run(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date: | 2020-05-27 11:49:04 |
Message-ID: | CAKYtNAof+vVNQApsedWYQ2orvwWknmresB9=6-pNwHX60cjxEQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, 26 May 2020 at 16:46, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier
it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue. Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory. So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well. It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact? As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it. Having it allocated in memory context which we can
> reset later might reduce that risk. One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there. In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to
test wal size and performance. Below is the testing summary;
*Test parameters:*
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'
*Test results:*
CREATE index operations Add col int(date) operations Add col text operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff
(in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) %
LSN change