Get rid of WALBufMappingLock

Lists: pgsql-hackers
From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Get rid of WALBufMappingLock
Date: 2025-01-19 00:11:32
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Good day, hackers.

During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
used benchmark which creates WAL records very intensively. While I this
it is not completely fair (1MB log records are really rare), it pushed
me to analyze write-side waiting of XLog machinery.

First I tried to optimize WaitXLogInsertionsToFinish, but without great
success (yet).

While profiling, I found a lot of time is spend in the memory clearing
under global WALBufMappingLock:

MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

It is obvious scalability bottleneck.

So "challenge was accepted".

Certainly, backend should initialize pages without exclusive lock. But
which way to ensure pages were initialized? In other words, how to
ensure XLogCtl->InitializedUpTo is correct.

I've tried to play around WALBufMappingLock with holding it for a short
time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
WALBufMappingLock is useless at all.

Instead of holding lock, it is better to allow backends to cooperate:
- I bound ConditionVariable to each xlblocks entry,
- every backend now checks every required block pointed by
InitializedUpto was successfully initialized or sleeps on its condvar,
- when backend sure block is initialized, it tries to update
InitializedUpTo with conditional variable.

Andres's benchmark looks like:

c=100 && install/bin/psql -c checkpoint -c 'select pg_switch_wal()'
postgres && install/bin/pgbench -n -M prepared -c$c -j$c -f <(echo
"SELECT pg_logical_emit_message(true, 'test', repeat('0',
1024*1024));";) -P1 -T45 postgres

So, it generate 1M records as fast as possible for 45 seconds.

Test machine is Ryzen 5825U (8c/16th) limited to 2GHz.
Config:

max_connections = 1000
shared_buffers = 1024MB
fsync = off
wal_sync_method = fdatasync
full_page_writes = off
wal_buffers = 1024MB
checkpoint_timeout = 1d

Results are: "average for 45 sec" /"1 second max outlier"

Results for master @ d3d098316913 :
25 clients: 2908 /3230
50 clients: 2759 /3130
100 clients: 2641 /2933
200 clients: 2419 /2707
400 clients: 1928 /2377
800 clients: 1689 /2266

With v0-0001-Get-rid-of-WALBufMappingLock.patch :
25 clients: 3103 /3583
50 clients: 3183 /3706
100 clients: 3106 /3559
200 clients: 2902 /3427
400 clients: 2303 /2717
800 clients: 1925 /2329

Combined with v0-0002-several-attempts-to-lock-WALInsertLocks.patch

No WALBufMappingLock + attempts on XLogInsertLock:
25 clients: 3518 /3750
50 clients: 3355 /3548
100 clients: 3226 /3460
200 clients: 3092 /3299
400 clients: 2575 /2801
800 clients: 1946 /2341

This results are with untouched NUM_XLOGINSERT_LOCKS == 8.

[1]
http://postgr.es/m/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

PS.
Increasing NUM_XLOGINSERT_LOCKS to 64 gives:
25 clients: 3457 /3624
50 clients: 3215 /3500
100 clients: 2750 /3000
200 clients: 2535 /2729
400 clients: 2163 /2400
800 clients: 1700 /2060

While doing this on master:
25 clients 2645 /2953
50 clients: 2562 /2968
100 clients: 2364 /2756
200 clients: 2266 /2564
400 clients: 1868 /2228
800 clients: 1527 /2133

So, patched version with increased NUM_XLOGINSERT_LOCKS looks no worse
than unpatched without increasing num of locks.

-------
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
v0-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 17.0 KB
v0-0002-several-attempts-to-lock-WALInsertLocks.patch text/x-patch 3.0 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-01-19 00:31:20
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

19.01.2025 03:11, Yura Sokolov пишет:
> Good day, hackers.
>
> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
> used benchmark which creates WAL records very intensively. While I this
> it is not completely fair (1MB log records are really rare), it pushed
> me to analyze write-side waiting of XLog machinery.
>
> First I tried to optimize WaitXLogInsertionsToFinish, but without great
> success (yet).
>
> While profiling, I found a lot of time is spend in the memory clearing
> under global WALBufMappingLock:
>
>     MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
>
> It is obvious scalability bottleneck.
>
> So "challenge was accepted".
>
> Certainly, backend should initialize pages without exclusive lock. But
> which way to ensure pages were initialized? In other words, how to
> ensure XLogCtl->InitializedUpTo is correct.
>
> I've tried to play around WALBufMappingLock with holding it for a short
> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
> WALBufMappingLock is useless at all.
>
> Instead of holding lock, it is better to allow backends to cooperate:
> - I bound ConditionVariable to each xlblocks entry,
> - every backend now checks every required block pointed by
> InitializedUpto was successfully initialized or sleeps on its condvar,
> - when backend sure block is initialized, it tries to update
> InitializedUpTo with conditional variable.
>
> Andres's benchmark looks like:
>
>   c=100 && install/bin/psql -c checkpoint -c 'select pg_switch_wal()'
> postgres && install/bin/pgbench -n -M prepared -c$c -j$c -f <(echo
> "SELECT pg_logical_emit_message(true, 'test', repeat('0',
> 1024*1024));";) -P1 -T45 postgres
>
> So, it generate 1M records as fast as possible for 45 seconds.
>
> Test machine is Ryzen 5825U (8c/16th) limited to 2GHz.
> Config:
>
>   max_connections = 1000
>   shared_buffers = 1024MB
>   fsync = off
>   wal_sync_method = fdatasync
>   full_page_writes = off
>   wal_buffers = 1024MB
>   checkpoint_timeout = 1d
>
> Results are: "average for 45 sec"  /"1 second max outlier"
>
> Results for master @ d3d098316913 :
>   25  clients: 2908  /3230
>   50  clients: 2759  /3130
>   100 clients: 2641  /2933
>   200 clients: 2419  /2707
>   400 clients: 1928  /2377
>   800 clients: 1689  /2266
>
> With v0-0001-Get-rid-of-WALBufMappingLock.patch :
>   25  clients: 3103  /3583
>   50  clients: 3183  /3706
>   100 clients: 3106  /3559
>   200 clients: 2902  /3427
>   400 clients: 2303  /2717
>   800 clients: 1925  /2329
>
> Combined with v0-0002-several-attempts-to-lock-WALInsertLocks.patch
>
> No WALBufMappingLock + attempts on XLogInsertLock:
>   25  clients: 3518  /3750
>   50  clients: 3355  /3548
>   100 clients: 3226  /3460
>   200 clients: 3092  /3299
>   400 clients: 2575  /2801
>   800 clients: 1946  /2341
>
> This results are with untouched NUM_XLOGINSERT_LOCKS == 8.
>
> [1] http://postgr.es/m/flat/3b11fdc2-9793-403d-
> b3d4-67ff9a00d447%40postgrespro.ru
>
>
> PS.
> Increasing NUM_XLOGINSERT_LOCKS to 64 gives:
>   25  clients: 3457  /3624
>   50  clients: 3215  /3500
>   100 clients: 2750  /3000
>   200 clients: 2535  /2729
>   400 clients: 2163  /2400
>   800 clients: 1700  /2060
>
> While doing this on master:
>   25  clients  2645  /2953
>   50  clients: 2562  /2968
>   100 clients: 2364  /2756
>   200 clients: 2266  /2564
>   400 clients: 1868  /2228
>   800 clients: 1527  /2133
>
> So, patched version with increased NUM_XLOGINSERT_LOCKS looks no worse
> than unpatched without increasing num of locks.

I'm too brave... or too sleepy (it's 3:30am)...
But I took the risk of sending a patch to commitfest:
https://commitfest.postgresql.org/52/5511/

------
regards
Yura Sokolov aka funny-falcon


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-06 22:26:04
Message-ID: CAPpHfdvDXFqPR-TzzOth50QjBwMx7f=p1s_Z5bf_dc9DK=968w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>
> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
> used benchmark which creates WAL records very intensively. While I this
> it is not completely fair (1MB log records are really rare), it pushed
> me to analyze write-side waiting of XLog machinery.
>
> First I tried to optimize WaitXLogInsertionsToFinish, but without great
> success (yet).
>
> While profiling, I found a lot of time is spend in the memory clearing
> under global WALBufMappingLock:
>
> MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
>
> It is obvious scalability bottleneck.
>
> So "challenge was accepted".
>
> Certainly, backend should initialize pages without exclusive lock. But
> which way to ensure pages were initialized? In other words, how to
> ensure XLogCtl->InitializedUpTo is correct.
>
> I've tried to play around WALBufMappingLock with holding it for a short
> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
> WALBufMappingLock is useless at all.
>
> Instead of holding lock, it is better to allow backends to cooperate:
> - I bound ConditionVariable to each xlblocks entry,
> - every backend now checks every required block pointed by
> InitializedUpto was successfully initialized or sleeps on its condvar,
> - when backend sure block is initialized, it tries to update
> InitializedUpTo with conditional variable.

Looks reasonable for me, but having ConditionVariable per xlog buffer
seems overkill for me. Find an attached revision, where I've
implemented advancing InitializedUpTo without ConditionVariable.
After initialization of each buffer there is attempt to do CAS for
InitializedUpTo in a loop. So, multiple processes will try to advance
InitializedUpTo, they could hijack initiative from each other, but
there is always a leader which will finish the work.

There is only one ConditionVariable to wait for InitializedUpTo being advanced.

I didn't benchmark my version, just checked that tests passed.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v1-0002-several-attempts-to-lock-WALInsertLocks.patch application/octet-stream 3.0 KB
v1-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 16.6 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-07 11:02:48
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

07.02.2025 01:26, Alexander Korotkov пишет:
> Hi!
>
> On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>
>> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
>> used benchmark which creates WAL records very intensively. While I this
>> it is not completely fair (1MB log records are really rare), it pushed
>> me to analyze write-side waiting of XLog machinery.
>>
>> First I tried to optimize WaitXLogInsertionsToFinish, but without great
>> success (yet).
>>
>> While profiling, I found a lot of time is spend in the memory clearing
>> under global WALBufMappingLock:
>>
>> MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
>>
>> It is obvious scalability bottleneck.
>>
>> So "challenge was accepted".
>>
>> Certainly, backend should initialize pages without exclusive lock. But
>> which way to ensure pages were initialized? In other words, how to
>> ensure XLogCtl->InitializedUpTo is correct.
>>
>> I've tried to play around WALBufMappingLock with holding it for a short
>> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
>> WALBufMappingLock is useless at all.
>>
>> Instead of holding lock, it is better to allow backends to cooperate:
>> - I bound ConditionVariable to each xlblocks entry,
>> - every backend now checks every required block pointed by
>> InitializedUpto was successfully initialized or sleeps on its condvar,
>> - when backend sure block is initialized, it tries to update
>> InitializedUpTo with conditional variable.
>
> Looks reasonable for me, but having ConditionVariable per xlog buffer
> seems overkill for me. Find an attached revision, where I've
> implemented advancing InitializedUpTo without ConditionVariable.
> After initialization of each buffer there is attempt to do CAS for
> InitializedUpTo in a loop. So, multiple processes will try to advance
> InitializedUpTo, they could hijack initiative from each other, but
> there is always a leader which will finish the work.
>
> There is only one ConditionVariable to wait for InitializedUpTo being advanced.
>
> I didn't benchmark my version, just checked that tests passed.

Good day, Alexander.

I've got mixed but quite close result for both approaches (single or many
ConditionVariable) on the notebook. Since I have no access to larger
machine, I can't prove "many" is way better (or discover it worse).

Given patch after cleanup looks a bit smaller and clearer, I agree to keep
just single condition variable.

Cleaned version is attached.

I've changed condition for broadcast a bit ("less" instead "not equal"):
- buffer's border may already go into future,
- and then other backend will reach not yet initialized buffer and will
broadcast.

-------
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
v2-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 12.8 KB
v2-0002-several-attempts-to-lock-WALInsertLocks.patch text/x-patch 3.0 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-07 11:24:49
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

07.02.2025 14:02, Yura Sokolov пишет:
> 07.02.2025 01:26, Alexander Korotkov пишет:
>> Hi!
>>
>> On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>
>>> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
>>> used benchmark which creates WAL records very intensively. While I this
>>> it is not completely fair (1MB log records are really rare), it pushed
>>> me to analyze write-side waiting of XLog machinery.
>>>
>>> First I tried to optimize WaitXLogInsertionsToFinish, but without great
>>> success (yet).
>>>
>>> While profiling, I found a lot of time is spend in the memory clearing
>>> under global WALBufMappingLock:
>>>
>>> MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
>>>
>>> It is obvious scalability bottleneck.
>>>
>>> So "challenge was accepted".
>>>
>>> Certainly, backend should initialize pages without exclusive lock. But
>>> which way to ensure pages were initialized? In other words, how to
>>> ensure XLogCtl->InitializedUpTo is correct.
>>>
>>> I've tried to play around WALBufMappingLock with holding it for a short
>>> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
>>> WALBufMappingLock is useless at all.
>>>
>>> Instead of holding lock, it is better to allow backends to cooperate:
>>> - I bound ConditionVariable to each xlblocks entry,
>>> - every backend now checks every required block pointed by
>>> InitializedUpto was successfully initialized or sleeps on its condvar,
>>> - when backend sure block is initialized, it tries to update
>>> InitializedUpTo with conditional variable.
>>
>> Looks reasonable for me, but having ConditionVariable per xlog buffer
>> seems overkill for me. Find an attached revision, where I've
>> implemented advancing InitializedUpTo without ConditionVariable.
>> After initialization of each buffer there is attempt to do CAS for
>> InitializedUpTo in a loop. So, multiple processes will try to advance
>> InitializedUpTo, they could hijack initiative from each other, but
>> there is always a leader which will finish the work.
>>
>> There is only one ConditionVariable to wait for InitializedUpTo being advanced.
>>
>> I didn't benchmark my version, just checked that tests passed.
>
> Good day, Alexander.

Seems I was mistaken twice.

> I've got mixed but quite close result for both approaches (single or many
> ConditionVariable) on the notebook. Since I have no access to larger
> machine, I can't prove "many" is way better (or discover it worse).

1. "many condvars" (my variant) is strictly worse with num locks = 64 and
when pg_logical_emit_message emits just 1kB instead of 1MB.

Therefore, "single condvar" is strictly better.

> Given patch after cleanup looks a bit smaller and clearer, I agree to keep
> just single condition variable.
>
> Cleaned version is attached.
>
> I've changed condition for broadcast a bit ("less" instead "not equal"):
> - buffer's border may already go into future,
> - and then other backend will reach not yet initialized buffer and will
> broadcast.

2. I've inserted abort if "buffer's border went into future", and wasn't
able to trigger it.

So I returned update-loop's body to your variant.

-------
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
v3-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 12.8 KB
v3-0002-several-attempts-to-lock-WALInsertLocks.patch text/x-patch 3.0 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-07 11:39:09
Message-ID: CAPpHfduc9YMmGcBocD=03F+rbMV0XpA2KoHQC9bQN1QEpPjvPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 7, 2025 at 1:24 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 07.02.2025 14:02, Yura Sokolov пишет:
> > 07.02.2025 01:26, Alexander Korotkov пишет:
> >> Hi!
> >>
> >> On Sun, Jan 19, 2025 at 2:11 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >>>
> >>> During discussion of Increasing NUM_XLOGINSERT_LOCKS [1], Andres Freund
> >>> used benchmark which creates WAL records very intensively. While I this
> >>> it is not completely fair (1MB log records are really rare), it pushed
> >>> me to analyze write-side waiting of XLog machinery.
> >>>
> >>> First I tried to optimize WaitXLogInsertionsToFinish, but without great
> >>> success (yet).
> >>>
> >>> While profiling, I found a lot of time is spend in the memory clearing
> >>> under global WALBufMappingLock:
> >>>
> >>> MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
> >>>
> >>> It is obvious scalability bottleneck.
> >>>
> >>> So "challenge was accepted".
> >>>
> >>> Certainly, backend should initialize pages without exclusive lock. But
> >>> which way to ensure pages were initialized? In other words, how to
> >>> ensure XLogCtl->InitializedUpTo is correct.
> >>>
> >>> I've tried to play around WALBufMappingLock with holding it for a short
> >>> time and spinning on XLogCtl->xlblocks[nextidx]. But in the end I found
> >>> WALBufMappingLock is useless at all.
> >>>
> >>> Instead of holding lock, it is better to allow backends to cooperate:
> >>> - I bound ConditionVariable to each xlblocks entry,
> >>> - every backend now checks every required block pointed by
> >>> InitializedUpto was successfully initialized or sleeps on its condvar,
> >>> - when backend sure block is initialized, it tries to update
> >>> InitializedUpTo with conditional variable.
> >>
> >> Looks reasonable for me, but having ConditionVariable per xlog buffer
> >> seems overkill for me. Find an attached revision, where I've
> >> implemented advancing InitializedUpTo without ConditionVariable.
> >> After initialization of each buffer there is attempt to do CAS for
> >> InitializedUpTo in a loop. So, multiple processes will try to advance
> >> InitializedUpTo, they could hijack initiative from each other, but
> >> there is always a leader which will finish the work.
> >>
> >> There is only one ConditionVariable to wait for InitializedUpTo being advanced.
> >>
> >> I didn't benchmark my version, just checked that tests passed.
> >
> > Good day, Alexander.
>
> Seems I was mistaken twice.
>
> > I've got mixed but quite close result for both approaches (single or many
> > ConditionVariable) on the notebook. Since I have no access to larger
> > machine, I can't prove "many" is way better (or discover it worse).
>
> 1. "many condvars" (my variant) is strictly worse with num locks = 64 and
> when pg_logical_emit_message emits just 1kB instead of 1MB.
>
> Therefore, "single condvar" is strictly better.
>
> > Given patch after cleanup looks a bit smaller and clearer, I agree to keep
> > just single condition variable.
> >
> > Cleaned version is attached.
> >
> > I've changed condition for broadcast a bit ("less" instead "not equal"):
> > - buffer's border may already go into future,
> > - and then other backend will reach not yet initialized buffer and will
> > broadcast.
>
> 2. I've inserted abort if "buffer's border went into future", and wasn't
> able to trigger it.
>
> So I returned update-loop's body to your variant.

Good, thank you. I think 0001 patch is generally good, but needs some
further polishing, e.g. more comments explaining how does it work.

Regarding 0002 patch, it looks generally reasonable. But are 2
attempts always optimal? Are there cases of regression, or cases when
more attempts are even better? Could we have there some
self-adjusting mechanism like what we have for spinlocks?

------
Regards,
Alexander Korotkov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-08 10:07:16
Message-ID: CAPpHfduHe3cHkEgRAGjG6qLrsFuS2M0vD5fniEgQBF9kBipBFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> Good, thank you. I think 0001 patch is generally good, but needs some
> further polishing, e.g. more comments explaining how does it work.

Two things comes to my mind worth rechecking about 0001.
1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
sensitive to that. If so, I would propose to explicitly comment that
and add corresponding asserts.
2) Check if there are concurrency issues between
AdvanceXLInsertBuffer() and switching to the new WAL file.

------
Regards,
Alexander Korotkov
Supabase


From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-12 18:16:12
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

08.02.2025 13:07, Alexander Korotkov пишет:
> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> Good, thank you. I think 0001 patch is generally good, but needs some
>> further polishing, e.g. more comments explaining how does it work.

I tried to add more comments. I'm not good at, so recommendations are welcome.

> Two things comes to my mind worth rechecking about 0001.
> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> sensitive to that. If so, I would propose to explicitly comment that
> and add corresponding asserts.

They're certainly page aligned, since they are page borders.
I added assert on alignment of InitializeReserved for the sanity.

> 2) Check if there are concurrency issues between
> AdvanceXLInsertBuffer() and switching to the new WAL file.

There are no issues:
1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
GetXLogBuffer to zero-out WAL page.
2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
so switching wal is not concurrent. (Although, there is no need in this
exclusiveness, imho.)

> Regarding 0002 patch, it looks generally reasonable. But are 2
> attempts always optimal? Are there cases of regression, or cases when
> more attempts are even better? Could we have there some
> self-adjusting mechanism like what we have for spinlocks?

Well, I chose to perform 3 probes (2 conditional attempts + 1
unconditional) based on intuition. I have some experience in building hash
tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach
50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is
cache miss, it is hardly sensible to do more probes.

3 probes did better than 2 in other benchmark [1], although there were
NUM_XLOGINSERT_LOCK increased.

Excuse me for not bencmarking different choices here. I'll try to do
measurements in next days.

[1] https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

-------
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
v4-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 13.9 KB
v4-0002-several-attempts-to-lock-WALInsertLocks.patch text/x-patch 2.9 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-13 09:34:38
Message-ID: CAPpHfdtr0Gj0K88yBjhk8Rq-zT6WFSxC+eSCmMmywNc4pgr=rQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 08.02.2025 13:07, Alexander Korotkov пишет:
> > On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >> Good, thank you. I think 0001 patch is generally good, but needs some
> >> further polishing, e.g. more comments explaining how does it work.
>
> I tried to add more comments. I'm not good at, so recommendations are welcome.
>
> > Two things comes to my mind worth rechecking about 0001.
> > 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> > sensitive to that. If so, I would propose to explicitly comment that
> > and add corresponding asserts.
>
> They're certainly page aligned, since they are page borders.
> I added assert on alignment of InitializeReserved for the sanity.
>
> > 2) Check if there are concurrency issues between
> > AdvanceXLInsertBuffer() and switching to the new WAL file.
>
> There are no issues:
> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> GetXLogBuffer to zero-out WAL page.
> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> so switching wal is not concurrent. (Although, there is no need in this
> exclusiveness, imho.)

Good, thank you. I've also revised commit message and comments.

But I see another issue with this patch. In the worst case, we do
XLogWrite() by ourselves, and it could potentially could error out.
Without patch, that would cause WALBufMappingLock be released and
XLogCtl->InitializedUpTo not advanced. With the patch, that would
cause other processes infinitely waiting till we finish the
initialization.

Possible solution would be to save position of the page to be
initialized, and set it back to XLogCtl->InitializeReserved on error
(everywhere we do LWLockReleaseAll()). We also must check that on
error we only set XLogCtl->InitializeReserved to the past, because
there could be multiple concurrent failures. Also we need to
broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

> > Regarding 0002 patch, it looks generally reasonable. But are 2
> > attempts always optimal? Are there cases of regression, or cases when
> > more attempts are even better? Could we have there some
> > self-adjusting mechanism like what we have for spinlocks?
>
> Well, I chose to perform 3 probes (2 conditional attempts + 1
> unconditional) based on intuition. I have some experience in building hash
> tables, and cuckoo-hashing theory tells 2 probes is usually enough to reach
> 50% fill-rate, and 3 probes enough for ~75% fill rate. Since each probe is
> cache miss, it is hardly sensible to do more probes.
>
> 3 probes did better than 2 in other benchmark [1], although there were
> NUM_XLOGINSERT_LOCK increased.
>
> Excuse me for not bencmarking different choices here. I'll try to do
> measurements in next days.
>
> [1] https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

Ok, let's wait for your measurements.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v5-0002-several-attempts-to-lock-WALInsertLocks.patch application/octet-stream 3.0 KB
v5-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 16.6 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-13 09:45:26
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

13.02.2025 12:34, Alexander Korotkov пишет:
> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>> Good, thank you. I think 0001 patch is generally good, but needs some
>>>> further polishing, e.g. more comments explaining how does it work.
>>
>> I tried to add more comments. I'm not good at, so recommendations are welcome.
>>
>>> Two things comes to my mind worth rechecking about 0001.
>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
>>> sensitive to that. If so, I would propose to explicitly comment that
>>> and add corresponding asserts.
>>
>> They're certainly page aligned, since they are page borders.
>> I added assert on alignment of InitializeReserved for the sanity.
>>
>>> 2) Check if there are concurrency issues between
>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>
>> There are no issues:
>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>> GetXLogBuffer to zero-out WAL page.
>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
>> so switching wal is not concurrent. (Although, there is no need in this
>> exclusiveness, imho.)
>
> Good, thank you. I've also revised commit message and comments.
>
> But I see another issue with this patch. In the worst case, we do
> XLogWrite() by ourselves, and it could potentially could error out.
> Without patch, that would cause WALBufMappingLock be released and
> XLogCtl->InitializedUpTo not advanced. With the patch, that would
> cause other processes infinitely waiting till we finish the
> initialization.
>
> Possible solution would be to save position of the page to be
> initialized, and set it back to XLogCtl->InitializeReserved on error
> (everywhere we do LWLockReleaseAll()). We also must check that on
> error we only set XLogCtl->InitializeReserved to the past, because
> there could be multiple concurrent failures. Also we need to
> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.

The single place where AdvanceXLInsertBuffer is called outside of critical
section is in XLogBackgroundFlush. All other call stacks will issue server
restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.

XLogBackgroundFlush explicitely avoids writing buffers by passing
opportunistic=true parameter.

Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
since server will shutdown/restart.

Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
to XLogWrite here?

-------
regards
Yura Sokolov aka funny-falcon


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-13 10:08:13
Message-ID: CAPpHfdtKBZwHX9vFqyJWgjrFFrZ0wT6k3pp8=j0UqdEbNtKU7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 13.02.2025 12:34, Alexander Korotkov пишет:
> > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >> 08.02.2025 13:07, Alexander Korotkov пишет:
> >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >>>> Good, thank you. I think 0001 patch is generally good, but needs some
> >>>> further polishing, e.g. more comments explaining how does it work.
> >>
> >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> >>
> >>> Two things comes to my mind worth rechecking about 0001.
> >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> >>> sensitive to that. If so, I would propose to explicitly comment that
> >>> and add corresponding asserts.
> >>
> >> They're certainly page aligned, since they are page borders.
> >> I added assert on alignment of InitializeReserved for the sanity.
> >>
> >>> 2) Check if there are concurrency issues between
> >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> >>
> >> There are no issues:
> >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> >> GetXLogBuffer to zero-out WAL page.
> >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> >> so switching wal is not concurrent. (Although, there is no need in this
> >> exclusiveness, imho.)
> >
> > Good, thank you. I've also revised commit message and comments.
> >
> > But I see another issue with this patch. In the worst case, we do
> > XLogWrite() by ourselves, and it could potentially could error out.
> > Without patch, that would cause WALBufMappingLock be released and
> > XLogCtl->InitializedUpTo not advanced. With the patch, that would
> > cause other processes infinitely waiting till we finish the
> > initialization.
> >
> > Possible solution would be to save position of the page to be
> > initialized, and set it back to XLogCtl->InitializeReserved on error
> > (everywhere we do LWLockReleaseAll()). We also must check that on
> > error we only set XLogCtl->InitializeReserved to the past, because
> > there could be multiple concurrent failures. Also we need to
> > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
>
> The single place where AdvanceXLInsertBuffer is called outside of critical
> section is in XLogBackgroundFlush. All other call stacks will issue server
> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
>
> XLogBackgroundFlush explicitely avoids writing buffers by passing
> opportunistic=true parameter.
>
> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> since server will shutdown/restart.
>
> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> to XLogWrite here?

You're correct. I just reflected this in the next revision of the patch.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v6-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 16.6 KB
v6-0002-several-attempts-to-lock-WALInsertLocks.patch application/octet-stream 3.0 KB

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-13 16:39:32
Message-ID: CALT9ZEELhvTOD=Cs=QOG=T42yWAdmD53M5ELqJF53tefffuHrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Yura and Alexander!

On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > >>>> Good, thank you. I think 0001 patch is generally good, but needs some
> > >>>> further polishing, e.g. more comments explaining how does it work.
> > >>
> > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > >>
> > >>> Two things comes to my mind worth rechecking about 0001.
> > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> > >>> sensitive to that. If so, I would propose to explicitly comment that
> > >>> and add corresponding asserts.
> > >>
> > >> They're certainly page aligned, since they are page borders.
> > >> I added assert on alignment of InitializeReserved for the sanity.
> > >>
> > >>> 2) Check if there are concurrency issues between
> > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > >>
> > >> There are no issues:
> > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > >> GetXLogBuffer to zero-out WAL page.
> > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > >> so switching wal is not concurrent. (Although, there is no need in this
> > >> exclusiveness, imho.)
> > >
> > > Good, thank you. I've also revised commit message and comments.
> > >
> > > But I see another issue with this patch. In the worst case, we do
> > > XLogWrite() by ourselves, and it could potentially could error out.
> > > Without patch, that would cause WALBufMappingLock be released and
> > > XLogCtl->InitializedUpTo not advanced. With the patch, that would
> > > cause other processes infinitely waiting till we finish the
> > > initialization.
> > >
> > > Possible solution would be to save position of the page to be
> > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > (everywhere we do LWLockReleaseAll()). We also must check that on
> > > error we only set XLogCtl->InitializeReserved to the past, because
> > > there could be multiple concurrent failures. Also we need to
> > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> >
> > The single place where AdvanceXLInsertBuffer is called outside of critical
> > section is in XLogBackgroundFlush. All other call stacks will issue server
> > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> >
> > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > opportunistic=true parameter.
> >
> > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > since server will shutdown/restart.
> >
> > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > to XLogWrite here?
>
> You're correct. I just reflected this in the next revision of the patch.

I've looked at the patchset v6.

The overall goal to avoid locking looks good. Both patches look right
to me. The only thing I'm slightly concerned about if patchset could
demonstrate performance differences in any real workload. I appreciate
it as a beautiful optimization anyway.

I have the following proposals to change the code and comments:

For patch 0001:

- Struct XLBlocks contains just one pg_atomic_uint64 member. Is it
still needed as a struct? These changes make a significant volume of
changes to the patch, being noop. Maybe it was inherited from v1 and
not needed anymore.

- Furthermore when xlblocks became a struct comments like:
> and xlblocks values certainly do. xlblocks values are changed
need to be changed to xlblocks.bound. This could be avoided by
changing back xlblocks from type XLBlocks * to pg_atomic_uint64 *

- It's worth more detailed commenting
InitializedUpTo/InitializedUpToCondVar than:
+ * It is updated to successfully initialized buffer's
identities, perhaps
+ * waiting on conditional variable bound to buffer.

"perhaps waiting" could also be in style "maybe/even while AAA waits BBB"

"lock-free with cooperation with" -> "lock-free accompanied by changes to..."

- Comment inside typedef struct XLogCtlData:
/* 1st byte ptr-s + XLOG_BLCKSZ (and condvar * for) */
need to be returned back
/* 1st byte ptr-s + XLOG_BLCKSZ */

- Commented out code for cleanup in the final patch:
//ConditionVariableBroadcast(&XLogCtl->xlblocks[nextidx].condvar);

- in AdvanceXLInsertBuffer()
npages initialised to 0 but it is not increased anymore
Block under
> if (XLOG_DEBUG && npages > 0)
became unreachable

(InvalidXLogRecPtr + 1) is essentially 0+1 and IMO this semantically
calls for adding #define FirstValidXLogRecPtr 1

Typo in a commit message: %s/usign/using/g

For patch 0002:

I think Yura's explanation from above in this thread need to get place
in a commit message and in a comment to this:
> int attempts = 2;

Comments around:
"try another lock next time" could be modified to reflect that we do
repeat twice

Kind regards,
Pavel Borisov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-13 20:59:06
Message-ID: CAPpHfdsMQ_Rv1rH3vN0uQdLP-u5b2BMCqzrKEasS=H3nQU+KsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Pavel!

On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > >>>> Good, thank you. I think 0001 patch is generally good, but needs some
> > > >>>> further polishing, e.g. more comments explaining how does it work.
> > > >>
> > > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > > >>
> > > >>> Two things comes to my mind worth rechecking about 0001.
> > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> > > >>> sensitive to that. If so, I would propose to explicitly comment that
> > > >>> and add corresponding asserts.
> > > >>
> > > >> They're certainly page aligned, since they are page borders.
> > > >> I added assert on alignment of InitializeReserved for the sanity.
> > > >>
> > > >>> 2) Check if there are concurrency issues between
> > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > > >>
> > > >> There are no issues:
> > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > > >> GetXLogBuffer to zero-out WAL page.
> > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > > >> so switching wal is not concurrent. (Although, there is no need in this
> > > >> exclusiveness, imho.)
> > > >
> > > > Good, thank you. I've also revised commit message and comments.
> > > >
> > > > But I see another issue with this patch. In the worst case, we do
> > > > XLogWrite() by ourselves, and it could potentially could error out.
> > > > Without patch, that would cause WALBufMappingLock be released and
> > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would
> > > > cause other processes infinitely waiting till we finish the
> > > > initialization.
> > > >
> > > > Possible solution would be to save position of the page to be
> > > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > > (everywhere we do LWLockReleaseAll()). We also must check that on
> > > > error we only set XLogCtl->InitializeReserved to the past, because
> > > > there could be multiple concurrent failures. Also we need to
> > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> > >
> > > The single place where AdvanceXLInsertBuffer is called outside of critical
> > > section is in XLogBackgroundFlush. All other call stacks will issue server
> > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> > >
> > > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > > opportunistic=true parameter.
> > >
> > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > > since server will shutdown/restart.
> > >
> > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > > to XLogWrite here?
> >
> > You're correct. I just reflected this in the next revision of the patch.
>
> I've looked at the patchset v6.

Oh, sorry, I really did wrong. I've done git format-patch for wrong
local branch for v5 and v6. Patches I've sent for v5 and v6 are
actually the same as my v1. This is really pity. Please, find the
right version of patchset attached.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v7-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 14.5 KB
v7-0002-several-attempts-to-lock-WALInsertLocks.patch application/octet-stream 2.9 KB

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-14 09:45:44
Message-ID: CALT9ZEF_qeGxVkCesJxwK10BQdESTp7C6MtcG=pwqxinhNUD-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Alexander!

On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> Hi, Pavel!
>
> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> > On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > > >>>> Good, thank you. I think 0001 patch is generally good, but needs some
> > > > >>>> further polishing, e.g. more comments explaining how does it work.
> > > > >>
> > > > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > > > >>
> > > > >>> Two things comes to my mind worth rechecking about 0001.
> > > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> > > > >>> sensitive to that. If so, I would propose to explicitly comment that
> > > > >>> and add corresponding asserts.
> > > > >>
> > > > >> They're certainly page aligned, since they are page borders.
> > > > >> I added assert on alignment of InitializeReserved for the sanity.
> > > > >>
> > > > >>> 2) Check if there are concurrency issues between
> > > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > > > >>
> > > > >> There are no issues:
> > > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > > > >> GetXLogBuffer to zero-out WAL page.
> > > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > > > >> so switching wal is not concurrent. (Although, there is no need in this
> > > > >> exclusiveness, imho.)
> > > > >
> > > > > Good, thank you. I've also revised commit message and comments.
> > > > >
> > > > > But I see another issue with this patch. In the worst case, we do
> > > > > XLogWrite() by ourselves, and it could potentially could error out.
> > > > > Without patch, that would cause WALBufMappingLock be released and
> > > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would
> > > > > cause other processes infinitely waiting till we finish the
> > > > > initialization.
> > > > >
> > > > > Possible solution would be to save position of the page to be
> > > > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > > > (everywhere we do LWLockReleaseAll()). We also must check that on
> > > > > error we only set XLogCtl->InitializeReserved to the past, because
> > > > > there could be multiple concurrent failures. Also we need to
> > > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> > > >
> > > > The single place where AdvanceXLInsertBuffer is called outside of critical
> > > > section is in XLogBackgroundFlush. All other call stacks will issue server
> > > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> > > >
> > > > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > > > opportunistic=true parameter.
> > > >
> > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > > > since server will shutdown/restart.
> > > >
> > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > > > to XLogWrite here?
> > >
> > > You're correct. I just reflected this in the next revision of the patch.
> >
> > I've looked at the patchset v6.
>
> Oh, sorry, I really did wrong. I've done git format-patch for wrong
> local branch for v5 and v6. Patches I've sent for v5 and v6 are
> actually the same as my v1. This is really pity. Please, find the
> right version of patchset attached.

I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
landed in v7.

Other changes are not regarding code behavior. The things from my
previous review that still could apply to v7:

For 0001:

Comment change proposed:
"lock-free with cooperation with" -> "lock-free accompanied by changes
to..." (maybe other variant)

I propose a new define:
#define FirstValidXLogRecPtr 1
While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
that has no semantical meaning and it's better to avoid using direct
arithmetics to relate meaning of FirstValidXLogRecPtr from
InvalidXLogRecPtr.

For 0002 both comments proposals from my message applied to v6 apply
to v7 as well

[1] https://www.postgresql.org/message-id/d6799557-e352-42c8-80cc-ed36e3b8893c%40postgrespro.ru

Regards,
Pavel Borisov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-14 10:24:15
Message-ID: CAPpHfdto+M6uTcOh1Snw2isHUs0oZwA_ywioY7KW_R26nu7Wzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> > > On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > > > 13.02.2025 12:34, Alexander Korotkov пишет:
> > > > > > On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > > > > >> 08.02.2025 13:07, Alexander Korotkov пишет:
> > > > > >>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > > > >>>> Good, thank you. I think 0001 patch is generally good, but needs some
> > > > > >>>> further polishing, e.g. more comments explaining how does it work.
> > > > > >>
> > > > > >> I tried to add more comments. I'm not good at, so recommendations are welcome.
> > > > > >>
> > > > > >>> Two things comes to my mind worth rechecking about 0001.
> > > > > >>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> > > > > >>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> > > > > >>> sensitive to that. If so, I would propose to explicitly comment that
> > > > > >>> and add corresponding asserts.
> > > > > >>
> > > > > >> They're certainly page aligned, since they are page borders.
> > > > > >> I added assert on alignment of InitializeReserved for the sanity.
> > > > > >>
> > > > > >>> 2) Check if there are concurrency issues between
> > > > > >>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> > > > > >>
> > > > > >> There are no issues:
> > > > > >> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> > > > > >> GetXLogBuffer to zero-out WAL page.
> > > > > >> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> > > > > >> so switching wal is not concurrent. (Although, there is no need in this
> > > > > >> exclusiveness, imho.)
> > > > > >
> > > > > > Good, thank you. I've also revised commit message and comments.
> > > > > >
> > > > > > But I see another issue with this patch. In the worst case, we do
> > > > > > XLogWrite() by ourselves, and it could potentially could error out.
> > > > > > Without patch, that would cause WALBufMappingLock be released and
> > > > > > XLogCtl->InitializedUpTo not advanced. With the patch, that would
> > > > > > cause other processes infinitely waiting till we finish the
> > > > > > initialization.
> > > > > >
> > > > > > Possible solution would be to save position of the page to be
> > > > > > initialized, and set it back to XLogCtl->InitializeReserved on error
> > > > > > (everywhere we do LWLockReleaseAll()). We also must check that on
> > > > > > error we only set XLogCtl->InitializeReserved to the past, because
> > > > > > there could be multiple concurrent failures. Also we need to
> > > > > > broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> > > > >
> > > > > The single place where AdvanceXLInsertBuffer is called outside of critical
> > > > > section is in XLogBackgroundFlush. All other call stacks will issue server
> > > > > restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> > > > >
> > > > > XLogBackgroundFlush explicitely avoids writing buffers by passing
> > > > > opportunistic=true parameter.
> > > > >
> > > > > Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> > > > > since server will shutdown/restart.
> > > > >
> > > > > Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> > > > > to XLogWrite here?
> > > >
> > > > You're correct. I just reflected this in the next revision of the patch.
> > >
> > > I've looked at the patchset v6.
> >
> > Oh, sorry, I really did wrong. I've done git format-patch for wrong
> > local branch for v5 and v6. Patches I've sent for v5 and v6 are
> > actually the same as my v1. This is really pity. Please, find the
> > right version of patchset attached.
>
> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
> landed in v7.
>
> Other changes are not regarding code behavior. The things from my
> previous review that still could apply to v7:
>
> For 0001:
>
> Comment change proposed:
> "lock-free with cooperation with" -> "lock-free accompanied by changes
> to..." (maybe other variant)

Good catch. I've rephrased this comment even more.

> I propose a new define:
> #define FirstValidXLogRecPtr 1
> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
> that has no semantical meaning and it's better to avoid using direct
> arithmetics to relate meaning of FirstValidXLogRecPtr from
> InvalidXLogRecPtr.

Makes sense, but I'm not sure if this change is required at all. I've
reverted this to the state of master, and everything seems to work.

> For 0002 both comments proposals from my message applied to v6 apply
> to v7 as well

Thank you for pointing. For now, I'm concentrated on improvements on
0001. Probably Yura could work on your notes to 0002.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v8-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 14.2 KB
v8-0002-several-attempts-to-lock-WALInsertLocks.patch application/octet-stream 2.9 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-14 14:09:18
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

14.02.2025 13:24, Alexander Korotkov пишет:
> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>>>
>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет:
>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>>>>>>>> Good, thank you. I think 0001 patch is generally good, but needs some
>>>>>>>>>> further polishing, e.g. more comments explaining how does it work.
>>>>>>>>
>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are welcome.
>>>>>>>>
>>>>>>>>> Two things comes to my mind worth rechecking about 0001.
>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>>>>>>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
>>>>>>>>> sensitive to that. If so, I would propose to explicitly comment that
>>>>>>>>> and add corresponding asserts.
>>>>>>>>
>>>>>>>> They're certainly page aligned, since they are page borders.
>>>>>>>> I added assert on alignment of InitializeReserved for the sanity.
>>>>>>>>
>>>>>>>>> 2) Check if there are concurrency issues between
>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>>>>>>>
>>>>>>>> There are no issues:
>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>>>>>>>> GetXLogBuffer to zero-out WAL page.
>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this
>>>>>>>> exclusiveness, imho.)
>>>>>>>
>>>>>>> Good, thank you. I've also revised commit message and comments.
>>>>>>>
>>>>>>> But I see another issue with this patch. In the worst case, we do
>>>>>>> XLogWrite() by ourselves, and it could potentially could error out.
>>>>>>> Without patch, that would cause WALBufMappingLock be released and
>>>>>>> XLogCtl->InitializedUpTo not advanced. With the patch, that would
>>>>>>> cause other processes infinitely waiting till we finish the
>>>>>>> initialization.
>>>>>>>
>>>>>>> Possible solution would be to save position of the page to be
>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error
>>>>>>> (everywhere we do LWLockReleaseAll()). We also must check that on
>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because
>>>>>>> there could be multiple concurrent failures. Also we need to
>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
>>>>>>
>>>>>> The single place where AdvanceXLInsertBuffer is called outside of critical
>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue server
>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
>>>>>>
>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing
>>>>>> opportunistic=true parameter.
>>>>>>
>>>>>> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
>>>>>> since server will shutdown/restart.
>>>>>>
>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
>>>>>> to XLogWrite here?
>>>>>
>>>>> You're correct. I just reflected this in the next revision of the patch.
>>>>
>>>> I've looked at the patchset v6.
>>>
>>> Oh, sorry, I really did wrong. I've done git format-patch for wrong
>>> local branch for v5 and v6. Patches I've sent for v5 and v6 are
>>> actually the same as my v1. This is really pity. Please, find the
>>> right version of patchset attached.
>>
>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
>> landed in v7.
>>
>> Other changes are not regarding code behavior. The things from my
>> previous review that still could apply to v7:
>>
>> For 0001:
>>
>> Comment change proposed:
>> "lock-free with cooperation with" -> "lock-free accompanied by changes
>> to..." (maybe other variant)
>
> Good catch. I've rephrased this comment even more.
>
>> I propose a new define:
>> #define FirstValidXLogRecPtr 1
>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
>> that has no semantical meaning and it's better to avoid using direct
>> arithmetics to relate meaning of FirstValidXLogRecPtr from
>> InvalidXLogRecPtr.
>
> Makes sense, but I'm not sure if this change is required at all. I've
> reverted this to the state of master, and everything seems to work.
>
>> For 0002 both comments proposals from my message applied to v6 apply
>> to v7 as well
>
> Thank you for pointing. For now, I'm concentrated on improvements on
> 0001. Probably Yura could work on your notes to 0002.

I wrote good commit message for 0002 with calculated probabilities and
simple Ruby program which calculates them to explain choice of 2
conditional attempts. (At least I hope the message is good). And added
simple comment before `int attempts = 2;`

Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
added static assert on NUM_XLOGINSERT_LOCKS being power of 2.

(0001 patch is same as for v8)

-------
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
v9-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 14.2 KB
v9-0002-Several-attempts-to-lock-WALInsertLocks.patch text/x-patch 3.9 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-14 14:11:14
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

14.02.2025 17:09, Yura Sokolov пишет:
> 14.02.2025 13:24, Alexander Korotkov пишет:
>> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>>>>
>>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет:
>>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>>>>>>>>>> Good, thank you. I think 0001 patch is generally good, but needs some
>>>>>>>>>>> further polishing, e.g. more comments explaining how does it work.
>>>>>>>>>
>>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are welcome.
>>>>>>>>>
>>>>>>>>>> Two things comes to my mind worth rechecking about 0001.
>>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>>>>>>>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
>>>>>>>>>> sensitive to that. If so, I would propose to explicitly comment that
>>>>>>>>>> and add corresponding asserts.
>>>>>>>>>
>>>>>>>>> They're certainly page aligned, since they are page borders.
>>>>>>>>> I added assert on alignment of InitializeReserved for the sanity.
>>>>>>>>>
>>>>>>>>>> 2) Check if there are concurrency issues between
>>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>>>>>>>>
>>>>>>>>> There are no issues:
>>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>>>>>>>>> GetXLogBuffer to zero-out WAL page.
>>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
>>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this
>>>>>>>>> exclusiveness, imho.)
>>>>>>>>
>>>>>>>> Good, thank you. I've also revised commit message and comments.
>>>>>>>>
>>>>>>>> But I see another issue with this patch. In the worst case, we do
>>>>>>>> XLogWrite() by ourselves, and it could potentially could error out.
>>>>>>>> Without patch, that would cause WALBufMappingLock be released and
>>>>>>>> XLogCtl->InitializedUpTo not advanced. With the patch, that would
>>>>>>>> cause other processes infinitely waiting till we finish the
>>>>>>>> initialization.
>>>>>>>>
>>>>>>>> Possible solution would be to save position of the page to be
>>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error
>>>>>>>> (everywhere we do LWLockReleaseAll()). We also must check that on
>>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because
>>>>>>>> there could be multiple concurrent failures. Also we need to
>>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
>>>>>>>
>>>>>>> The single place where AdvanceXLInsertBuffer is called outside of critical
>>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue server
>>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
>>>>>>>
>>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing
>>>>>>> opportunistic=true parameter.
>>>>>>>
>>>>>>> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
>>>>>>> since server will shutdown/restart.
>>>>>>>
>>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
>>>>>>> to XLogWrite here?
>>>>>>
>>>>>> You're correct. I just reflected this in the next revision of the patch.
>>>>>
>>>>> I've looked at the patchset v6.
>>>>
>>>> Oh, sorry, I really did wrong. I've done git format-patch for wrong
>>>> local branch for v5 and v6. Patches I've sent for v5 and v6 are
>>>> actually the same as my v1. This is really pity. Please, find the
>>>> right version of patchset attached.
>>>
>>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
>>> landed in v7.
>>>
>>> Other changes are not regarding code behavior. The things from my
>>> previous review that still could apply to v7:
>>>
>>> For 0001:
>>>
>>> Comment change proposed:
>>> "lock-free with cooperation with" -> "lock-free accompanied by changes
>>> to..." (maybe other variant)
>>
>> Good catch. I've rephrased this comment even more.
>>
>>> I propose a new define:
>>> #define FirstValidXLogRecPtr 1
>>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
>>> that has no semantical meaning and it's better to avoid using direct
>>> arithmetics to relate meaning of FirstValidXLogRecPtr from
>>> InvalidXLogRecPtr.
>>
>> Makes sense, but I'm not sure if this change is required at all. I've
>> reverted this to the state of master, and everything seems to work.
>>
>>> For 0002 both comments proposals from my message applied to v6 apply
>>> to v7 as well
>>
>> Thank you for pointing. For now, I'm concentrated on improvements on
>> 0001. Probably Yura could work on your notes to 0002.
>
> I wrote good commit message for 0002 with calculated probabilities and
> simple Ruby program which calculates them to explain choice of 2
> conditional attempts. (At least I hope the message is good). And added
> simple comment before `int attempts = 2;`
>
> Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
> added static assert on NUM_XLOGINSERT_LOCKS being power of 2.
>
> (0001 patch is same as for v8)
>
> -------
> regards
> Yura Sokolov aka funny-falcon

Oops, forgot to add StaticAssert into v9-0002.

Attachment Content-Type Size
v10-0001-Get-rid-of-WALBufMappingLock.patch text/x-patch 14.2 KB
v10-0002-Several-attempts-to-lock-WALInsertLocks.patch text/x-patch 4.0 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-15 09:25:00
Message-ID: CAPpHfds==kM3YJoTX64rV-CVsan8ARk3w32tgdCZ69BzD01i_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Fri, Feb 14, 2025 at 4:11 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 14.02.2025 17:09, Yura Sokolov пишет:
> > 14.02.2025 13:24, Alexander Korotkov пишет:
> >> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> >>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> >>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >>>>>>
> >>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет:
> >>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет:
> >>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >>>>>>>>>>> Good, thank you. I think 0001 patch is generally good, but needs some
> >>>>>>>>>>> further polishing, e.g. more comments explaining how does it work.
> >>>>>>>>>
> >>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are welcome.
> >>>>>>>>>
> >>>>>>>>>> Two things comes to my mind worth rechecking about 0001.
> >>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
> >>>>>>>>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
> >>>>>>>>>> sensitive to that. If so, I would propose to explicitly comment that
> >>>>>>>>>> and add corresponding asserts.
> >>>>>>>>>
> >>>>>>>>> They're certainly page aligned, since they are page borders.
> >>>>>>>>> I added assert on alignment of InitializeReserved for the sanity.
> >>>>>>>>>
> >>>>>>>>>> 2) Check if there are concurrency issues between
> >>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
> >>>>>>>>>
> >>>>>>>>> There are no issues:
> >>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
> >>>>>>>>> GetXLogBuffer to zero-out WAL page.
> >>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion locks,
> >>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this
> >>>>>>>>> exclusiveness, imho.)
> >>>>>>>>
> >>>>>>>> Good, thank you. I've also revised commit message and comments.
> >>>>>>>>
> >>>>>>>> But I see another issue with this patch. In the worst case, we do
> >>>>>>>> XLogWrite() by ourselves, and it could potentially could error out.
> >>>>>>>> Without patch, that would cause WALBufMappingLock be released and
> >>>>>>>> XLogCtl->InitializedUpTo not advanced. With the patch, that would
> >>>>>>>> cause other processes infinitely waiting till we finish the
> >>>>>>>> initialization.
> >>>>>>>>
> >>>>>>>> Possible solution would be to save position of the page to be
> >>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error
> >>>>>>>> (everywhere we do LWLockReleaseAll()). We also must check that on
> >>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because
> >>>>>>>> there could be multiple concurrent failures. Also we need to
> >>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
> >>>>>>>
> >>>>>>> The single place where AdvanceXLInsertBuffer is called outside of critical
> >>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue server
> >>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
> >>>>>>>
> >>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing
> >>>>>>> opportunistic=true parameter.
> >>>>>>>
> >>>>>>> Therefore, error in XLogWrite will not cause hang in AdvanceXLInsertBuffer
> >>>>>>> since server will shutdown/restart.
> >>>>>>>
> >>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before call
> >>>>>>> to XLogWrite here?
> >>>>>>
> >>>>>> You're correct. I just reflected this in the next revision of the patch.
> >>>>>
> >>>>> I've looked at the patchset v6.
> >>>>
> >>>> Oh, sorry, I really did wrong. I've done git format-patch for wrong
> >>>> local branch for v5 and v6. Patches I've sent for v5 and v6 are
> >>>> actually the same as my v1. This is really pity. Please, find the
> >>>> right version of patchset attached.
> >>>
> >>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
> >>> landed in v7.
> >>>
> >>> Other changes are not regarding code behavior. The things from my
> >>> previous review that still could apply to v7:
> >>>
> >>> For 0001:
> >>>
> >>> Comment change proposed:
> >>> "lock-free with cooperation with" -> "lock-free accompanied by changes
> >>> to..." (maybe other variant)
> >>
> >> Good catch. I've rephrased this comment even more.
> >>
> >>> I propose a new define:
> >>> #define FirstValidXLogRecPtr 1
> >>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
> >>> that has no semantical meaning and it's better to avoid using direct
> >>> arithmetics to relate meaning of FirstValidXLogRecPtr from
> >>> InvalidXLogRecPtr.
> >>
> >> Makes sense, but I'm not sure if this change is required at all. I've
> >> reverted this to the state of master, and everything seems to work.
> >>
> >>> For 0002 both comments proposals from my message applied to v6 apply
> >>> to v7 as well
> >>
> >> Thank you for pointing. For now, I'm concentrated on improvements on
> >> 0001. Probably Yura could work on your notes to 0002.
> >
> > I wrote good commit message for 0002 with calculated probabilities and
> > simple Ruby program which calculates them to explain choice of 2
> > conditional attempts. (At least I hope the message is good). And added
> > simple comment before `int attempts = 2;`
> >
> > Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
> > added static assert on NUM_XLOGINSERT_LOCKS being power of 2.
> >
> > (0001 patch is same as for v8)
>
> Oops, forgot to add StaticAssert into v9-0002.

Thank you. I'm planning to push 0001 if there is no objections. And
I'm planning to do more review/revision of 0002.

------
Regards,
Alexander Korotkov
Supabase


From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 05:19:07
Message-ID: CALdSSPgtaG7wBin1p6p+w79BuGtYtqUQrLvDW=9syTZ+Y88JiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!
I spotted a typo in v10:

+ /*
+ * Page at nextidx wasn't initialized yet, so we cann't move
+ * InitializedUpto further. It will be moved by backend which
+ * will initialize nextidx.
+ */

cann't - > can't

moved by backend -> moved by the backend

--
Best regards,
Kirill Reshke


From: Victor Yegorov <vyegorov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 08:46:57
Message-ID: CAGnEbogrHOrXw8vaR3QP4hkEABF=L4VZaaViE3f5Ci+gP2cTeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hey.

I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock
it's being replaced by CV, actually.

Should the subject be changed to “Replace WALBufMappingLock with
ConditionVariable” instead?

--
Victor Yegorov


From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 09:20:09
Message-ID: CALT9ZEH11gZJdrN=FnEVLtL7RjjAzN8HzL38mC477MbX32HiXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Victor!

On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov(at)gmail(dot)com> wrote:
>
> Hey.
>
> I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.
>
> Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?

The patch replaces WALBufMappingLock with a lockless algorithm based
on atomic variables and CV. Mentioning only CV in the head is only a
part of implementation. Also, the header should better reflect what is
done on the whole, than the implementation details. So I'd rather see
a header like "Replace WALBufMappingLock by lockless algorithm" or
"Initialize WAL buffers concurrently without using WALBufMappingLock"
or something like that.

Kind regards,
Pavel Borisov
Supabase


From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 09:24:06
Message-ID: CALT9ZEHTOmtkZfuLZ=oQBVR+ihtiEiS-fE311=mMay6qmPKhzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 17 Feb 2025 at 13:20, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>
> Hi, Victor!
>
> On Mon, 17 Feb 2025 at 12:47, Victor Yegorov <vyegorov(at)gmail(dot)com> wrote:
> >
> > Hey.
> >
> > I find “Get rid of WALBufMappingLock" commit message misleading, 'cos Lock it's being replaced by CV, actually.
> >
> > Should the subject be changed to “Replace WALBufMappingLock with ConditionVariable” instead?
>
> The patch replaces WALBufMappingLock with a lockless algorithm based
> on atomic variables and CV. Mentioning only CV in the head is only a
> part of implementation. Also, the header should better reflect what is
> done on the whole, than the implementation details. So I'd rather see
> a header like "Replace WALBufMappingLock by lockless algorithm" or
> "Initialize WAL buffers concurrently without using WALBufMappingLock"
> or something like that.
Update: I see the patch is already committed, so we're late with the
naming proposals. I don't see problem with existing commit message
TBH.

Kind regards,
Pavel Borisov


From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 09:40:43
Message-ID: CALT9ZEGMpMS7wURqgR+8CCcMOudPDRPa4eM2fuARuPOx8949WA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Kirill!
Per your report, I revised the comment to fix typos. Also some little
changes in grammar.

Kind regards,
Pavel Borisov

Attachment Content-Type Size
0001-Fix-typo-and-grammar-in-comment-introduced-by-6a2275.patch application/octet-stream 1.2 KB

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 09:44:25
Message-ID: CALT9ZEF7DZinXVRtc=52P7K1WvL89LqxhU=8K0K2BRpMxvt8wA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oops, I send wrong patch as a fix.
The right one is attached.

Pavel

On Mon, 17 Feb 2025 at 13:40, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
>
> Hi, Kirill!
> Per your report, I revised the comment to fix typos. Also some little
> changes in grammar.
>
> Kind regards,
> Pavel Borisov

Attachment Content-Type Size
v2-0001-Fix-typo-and-grammar-in-comment-introduced-by-6a2.patch application/octet-stream 1.2 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 10:34:47
Message-ID: CAPpHfdtxiT03yXGsc0_gpKV7jhLsmrBKHb2n_Pr6-Bx_JAqZ4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 17, 2025 at 11:44 AM Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com> wrote:
> Oops, I send wrong patch as a fix.
> The right one is attached.
>
> Pavel

I've spotted the failure on the buildfarm.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03
I can't quickly guess the reason. I'm going to revert patch for now,
then we investigate

------
Regards,
Alexander Korotkov
Supabase


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-17 16:25:05
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> I've spotted the failure on the buildfarm.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2025-02-17%2008%3A05%3A03
> I can't quickly guess the reason. I'm going to revert patch for now,
> then we investigate

This timeout failure on hachi looks suspicious as well:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03

Might be relevant that they are both aarch64?

regards, tom lane


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-18 00:21:13
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> This timeout failure on hachi looks suspicious as well:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
>
> Might be relevant that they are both aarch64?

Just logged into the host. The logs of the timed out run are still
around, and the last information I can see is from lastcommand.log,
which seems to have frozen in time when the timeout has begun its
vacuuming work:
ok 73 + index_including_gist 353 ms
# parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse

gokiburi is on the same host, and it is currently frozen in time when
trying to fetch a WAL buffer. One of the stack traces:
#2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
latch.c:1571
#3 WaitEventSetWait (set=0xd34ce0, timeout=-1,
occurred_events=occurred_events(at)entry=0xffffffffadd8,
nevents=nevents(at)entry=1, wait_event_info=<optimized out>,
wait_event_info(at)entry=134217781) at latch.c:1519
#4 0x000000000084e964 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents(at)entry=33, timeout=timeout(at)entry=-1,
wait_event_info=wait_event_info(at)entry=134217781) at latch.c:538
#5 0x000000000085d2f8 in ConditionVariableTimedSleep
(cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
condition_variable.c:163
#6 0x000000000085d1ec in ConditionVariableSleep
(cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
#7 0x000000000055f4f4 in AdvanceXLInsertBuffer
(upto=upto(at)entry=112064880, tli=tli(at)entry=1, opportunistic=false) at
xlog.c:2224
#8 0x0000000000568398 in GetXLogBuffer (ptr=ptr(at)entry=112064880,
tli=tli(at)entry=1) at xlog.c:1710
#9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
EndPos=<optimized out>, tli=1) at xlog.c:1245
#10 XLogInsertRecord (rdata=rdata(at)entry=0xcc49b0 <hdr_rdt>,
fpw_lsn=fpw_lsn(at)entry=112025520, flags=0 '\000', num_fpi=<optimized
out>, num_fpi(at)entry=0, topxid_included=false) at xlog.c:928
#11 0x000000000056b870 in XLogInsert (rmid=rmid(at)entry=16 '\020',
info=<optimized out>, info(at)entry=16 '\020') at xloginsert.c:523
#12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
parent=<optimized out>,

So, yes, something looks really wrong with this patch. Sounds
plausible to me that some other buildfarm animals could be stuck
without their owners knowing about it. It's proving to be a good idea
to force a timeout value in the configuration file of these animals..
--
Michael


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-18 00:29:50
Message-ID: CAPpHfdsLtV5LwFp7UBv0Tt0oYbVv+QV8TnaCg-RBSMJNqTvyTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > This timeout failure on hachi looks suspicious as well:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> >
> > Might be relevant that they are both aarch64?
>
> Just logged into the host. The logs of the timed out run are still
> around, and the last information I can see is from lastcommand.log,
> which seems to have frozen in time when the timeout has begun its
> vacuuming work:
> ok 73 + index_including_gist 353 ms
> # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse
>
> gokiburi is on the same host, and it is currently frozen in time when
> trying to fetch a WAL buffer. One of the stack traces:
> #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> latch.c:1571
> #3 WaitEventSetWait (set=0xd34ce0, timeout=-1,
> occurred_events=occurred_events(at)entry=0xffffffffadd8,
> nevents=nevents(at)entry=1, wait_event_info=<optimized out>,
> wait_event_info(at)entry=134217781) at latch.c:1519
> #4 0x000000000084e964 in WaitLatch (latch=<optimized out>,
> wakeEvents=wakeEvents(at)entry=33, timeout=timeout(at)entry=-1,
> wait_event_info=wait_event_info(at)entry=134217781) at latch.c:538
> #5 0x000000000085d2f8 in ConditionVariableTimedSleep
> (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> condition_variable.c:163
> #6 0x000000000085d1ec in ConditionVariableSleep
> (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> #7 0x000000000055f4f4 in AdvanceXLInsertBuffer
> (upto=upto(at)entry=112064880, tli=tli(at)entry=1, opportunistic=false) at
> xlog.c:2224
> #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr(at)entry=112064880,
> tli=tli(at)entry=1) at xlog.c:1710
> #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> EndPos=<optimized out>, tli=1) at xlog.c:1245
> #10 XLogInsertRecord (rdata=rdata(at)entry=0xcc49b0 <hdr_rdt>,
> fpw_lsn=fpw_lsn(at)entry=112025520, flags=0 '\000', num_fpi=<optimized
> out>, num_fpi(at)entry=0, topxid_included=false) at xlog.c:928
> #11 0x000000000056b870 in XLogInsert (rmid=rmid(at)entry=16 '\020',
> info=<optimized out>, info(at)entry=16 '\020') at xloginsert.c:523
> #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> parent=<optimized out>,
>
> So, yes, something looks really wrong with this patch. Sounds
> plausible to me that some other buildfarm animals could be stuck
> without their owners knowing about it. It's proving to be a good idea
> to force a timeout value in the configuration file of these animals..

Tom, Michael, thank you for the information.
This patch will be better tested before next attempt.

------
Regards,
Alexander Korotkov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-25 15:19:29
Message-ID: CAPpHfdsXCRg-arg3CFAYR4PXWTc7G1qY=2V7Yap3j83_5qe+Hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > > This timeout failure on hachi looks suspicious as well:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> > >
> > > Might be relevant that they are both aarch64?
> >
> > Just logged into the host. The logs of the timed out run are still
> > around, and the last information I can see is from lastcommand.log,
> > which seems to have frozen in time when the timeout has begun its
> > vacuuming work:
> > ok 73 + index_including_gist 353 ms
> > # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse
> >
> > gokiburi is on the same host, and it is currently frozen in time when
> > trying to fetch a WAL buffer. One of the stack traces:
> > #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> > latch.c:1571
> > #3 WaitEventSetWait (set=0xd34ce0, timeout=-1,
> > occurred_events=occurred_events(at)entry=0xffffffffadd8,
> > nevents=nevents(at)entry=1, wait_event_info=<optimized out>,
> > wait_event_info(at)entry=134217781) at latch.c:1519
> > #4 0x000000000084e964 in WaitLatch (latch=<optimized out>,
> > wakeEvents=wakeEvents(at)entry=33, timeout=timeout(at)entry=-1,
> > wait_event_info=wait_event_info(at)entry=134217781) at latch.c:538
> > #5 0x000000000085d2f8 in ConditionVariableTimedSleep
> > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> > condition_variable.c:163
> > #6 0x000000000085d1ec in ConditionVariableSleep
> > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> > #7 0x000000000055f4f4 in AdvanceXLInsertBuffer
> > (upto=upto(at)entry=112064880, tli=tli(at)entry=1, opportunistic=false) at
> > xlog.c:2224
> > #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr(at)entry=112064880,
> > tli=tli(at)entry=1) at xlog.c:1710
> > #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> > EndPos=<optimized out>, tli=1) at xlog.c:1245
> > #10 XLogInsertRecord (rdata=rdata(at)entry=0xcc49b0 <hdr_rdt>,
> > fpw_lsn=fpw_lsn(at)entry=112025520, flags=0 '\000', num_fpi=<optimized
> > out>, num_fpi(at)entry=0, topxid_included=false) at xlog.c:928
> > #11 0x000000000056b870 in XLogInsert (rmid=rmid(at)entry=16 '\020',
> > info=<optimized out>, info(at)entry=16 '\020') at xloginsert.c:523
> > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> > parent=<optimized out>,
> >
> > So, yes, something looks really wrong with this patch. Sounds
> > plausible to me that some other buildfarm animals could be stuck
> > without their owners knowing about it. It's proving to be a good idea
> > to force a timeout value in the configuration file of these animals..
>
> Tom, Michael, thank you for the information.
> This patch will be better tested before next attempt.

It seems that I managed to reproduce the issue on my Raspberry PI 4.
After running our test suite in a loop for 2 days I found one timeout.

I have hypothesis on why it might happen. We don't have protection
against two backends in parallel get ReservedPtr mapped to a single
XLog buffer. I've talked to Yura off-list about that. He pointer out
that XLogWrite() should issue a PANIC in that case, which we didn't
observe. However, I'm not sure this analysis is complete.

One way or another, we need protection against this situation any way.
The updated patch is attached. Now, after acquiring ReservedPtr it
waits till OldPageRqstPtr gets initialized. Additionally I've to
implement more accurate calculation of OldPageRqstPtr. I run tests
with new patch on my Raspberry in a loop. Let's see how it goes.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v11-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 14.6 KB

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-26 01:03:57
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
> It seems that I managed to reproduce the issue on my Raspberry PI 4.
> After running our test suite in a loop for 2 days I found one timeout.

Hmm. It's surprising to not see a higher occurence. My buildfarm
host has caught that on its first run after the patch, for two
different animals which are both on the same machine.

> One way or another, we need protection against this situation any way.
> The updated patch is attached. Now, after acquiring ReservedPtr it
> waits till OldPageRqstPtr gets initialized. Additionally I've to
> implement more accurate calculation of OldPageRqstPtr. I run tests
> with new patch on my Raspberry in a loop. Let's see how it goes.

Perhaps you'd prefer that I do more tests with your patch? This is
time-consuming for you. This is not a review of the internals of the
patch, and I cannot give you access to the host, but if my stuff is
the only place where we have a good reproducibility of the issue, I'm
OK to grab some time and run a couple of checks to avoid again a
freeze of the buildfarm.
--
Michael


From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Kirill Reshke <reshke(at)yandex-team(dot)ru>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-26 08:52:17
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
>

Hi!

One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere in the codebase. But know Kirill asked to me explain what is going on and I cannot.

This seems to be relevant… so.

+ while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
// Assume ConditionVariableBroadcast() happened here, but before next line
+ ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
+ ConditionVariableCancelSleep();

Won’t this sleep wait forever?

I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a little pointers on why it works.

Best regards, Andrey Borodin.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-26 11:48:47
Message-ID: CAPpHfdv9GrPDB6tziSzFgO3B-xdfrT4GrZt3aR_1=Zxtu6y8iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, Michael!

On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
> > It seems that I managed to reproduce the issue on my Raspberry PI 4.
> > After running our test suite in a loop for 2 days I found one timeout.
>
> Hmm. It's surprising to not see a higher occurence. My buildfarm
> host has caught that on its first run after the patch, for two
> different animals which are both on the same machine.
>
> > One way or another, we need protection against this situation any way.
> > The updated patch is attached. Now, after acquiring ReservedPtr it
> > waits till OldPageRqstPtr gets initialized. Additionally I've to
> > implement more accurate calculation of OldPageRqstPtr. I run tests
> > with new patch on my Raspberry in a loop. Let's see how it goes.
>
> Perhaps you'd prefer that I do more tests with your patch? This is
> time-consuming for you. This is not a review of the internals of the
> patch, and I cannot give you access to the host, but if my stuff is
> the only place where we have a good reproducibility of the issue, I'm
> OK to grab some time and run a couple of checks to avoid again a
> freeze of the buildfarm.

Thank you for offering the help. Updated version of patch is attached
(I've added one memory barrier there just in case). I would
appreciate if you could run it on batta, hachi or similar hardware.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v12-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 14.6 KB

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-28 07:27:29
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 26, 2025 at 01:48:47PM +0200, Alexander Korotkov wrote:
> Thank you for offering the help. Updated version of patch is attached
> (I've added one memory barrier there just in case). I would
> appreciate if you could run it on batta, hachi or similar hardware.

Doing a revert of the revert done in 3fb58625d18f proves that
reproducing the error is not really difficult. I've done a make
installcheck-world USE_MODULE_DB=1 -j N without assertions, and the
point that saw a failure quickly is one of the tests of pgbench:
PANIC: could not find WAL buffer for 0/19D366

This one happened for the test "concurrent OID generation" and CREATE
TYPE. Of course, as it is a race condition, it is random, but it's
taking me only a couple of minutes to see the original issue on my
buildfarm host. With assertion failures enabled, same story, and same
failure from the pgbench TAP test.

Saying that, I have also done similar tests with your v12 for a couple
of hours and this looks stable under installcheck-world. I can see
that you've reworked quite a bit the surroundings of InitializedFrom
in this one. If you apply that once again at some point, the
buildfarm will be judge in the long-term, but I am rather confident by
saying that the situation looks better here, at least.

One thing I would consider doing if you want to gain confidence is
tests like the one I saw causing problems with pgbench, with DDL
patterns stressing specific paths like this CREATE TYPE case.
--
Michael


From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Kirill Reshke <reshke(at)yandex-team(dot)ru>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-28 12:12:36
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

26.02.2025 11:52, Andrey Borodin wrote:
>> On 25 Feb 2025, at 20:19, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>
>>
>
>
> Hi!
>
> One little piece of code looks suspicious to me. But I was not raising concern because I see similar code everywhere in the codebase. But know Kirill asked to me explain what is going on and I cannot.
>
> This seems to be relevant… so.
>
> + while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
> // Assume ConditionVariableBroadcast() happened here, but before next line
> + ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
> + ConditionVariableCancelSleep();
>
> Won’t this sleep wait forever?

Because ConditionVariableSleep doesn't sleep for the first time.
It just performs ConditionVariablePrepareToSleep and immediately returns.
So actual condition of `while` loop is checked at least twice before going
to sleep.

>
> I see about 20 other occurrences of similar code, so, perhaps, everything is fine. But I would greatly appreciate a little pointers on why it works.

-------
regards
Yura Sokolov aka funny-falcon


From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-28 12:24:14
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

26.02.2025 14:48, Alexander Korotkov пишет:
> Hi, Michael!
>
> On Wed, Feb 26, 2025 at 3:04 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>
>> On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
>>> It seems that I managed to reproduce the issue on my Raspberry PI 4.
>>> After running our test suite in a loop for 2 days I found one timeout.
>>
>> Hmm. It's surprising to not see a higher occurence. My buildfarm
>> host has caught that on its first run after the patch, for two
>> different animals which are both on the same machine.
>>
>>> One way or another, we need protection against this situation any way.
>>> The updated patch is attached. Now, after acquiring ReservedPtr it
>>> waits till OldPageRqstPtr gets initialized. Additionally I've to
>>> implement more accurate calculation of OldPageRqstPtr. I run tests
>>> with new patch on my Raspberry in a loop. Let's see how it goes.
>>
>> Perhaps you'd prefer that I do more tests with your patch? This is
>> time-consuming for you. This is not a review of the internals of the
>> patch, and I cannot give you access to the host, but if my stuff is
>> the only place where we have a good reproducibility of the issue, I'm
>> OK to grab some time and run a couple of checks to avoid again a
>> freeze of the buildfarm.
>
> Thank you for offering the help. Updated version of patch is attached
> (I've added one memory barrier there just in case). I would
> appreciate if you could run it on batta, hachi or similar hardware.

Good day, Alexander.

Checked your additions to patch. They're clear and robust.

-------
regards
Yura Sokolov aka funny-falcon


From: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-28 13:13:23
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2025-Feb-28, Michael Paquier wrote:

> Saying that, I have also done similar tests with your v12 for a couple
> of hours and this looks stable under installcheck-world. I can see
> that you've reworked quite a bit the surroundings of InitializedFrom
> in this one. If you apply that once again at some point, the
> buildfarm will be judge in the long-term, but I am rather confident by
> saying that the situation looks better here, at least.

Heh, no amount of testing can prove lack of bugs; but for sure "it looks
different now, so it must be correct" must be the weakest proof of
correctness I've heard of!

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this." (Fotis)
https://postgr.es/m/[email protected]


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-28 13:43:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote:
> On 2025-Feb-28, Michael Paquier wrote:
>> Saying that, I have also done similar tests with your v12 for a couple
>> of hours and this looks stable under installcheck-world. I can see
>> that you've reworked quite a bit the surroundings of InitializedFrom
>> in this one. If you apply that once again at some point, the
>> buildfarm will be judge in the long-term, but I am rather confident by
>> saying that the situation looks better here, at least.
>
> Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> different now, so it must be correct" must be the weakest proof of
> correctness I've heard of!

Err, okay. I did use the word "stable" with tests rather than
"correct", and I implied upthread that I did not check the correctness
nor the internals of the patch. If my words held the meaning you
are implying, well, my apologies for the confusion, I guess. I only
tested the patch and it was stable while I've noticed a few diffs with
the previous version, but I did *not* check its internals at all, nor
do I mean that I endorse its logic. I hope that's clear now.
--
Michael


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-02 11:58:36
Message-ID: CAPpHfdsWd1b7ENyiJZYhhp2_vTCRQm=2np3iDdqA=LmeXUP+jQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> On 2025-Feb-28, Michael Paquier wrote:
>
> > Saying that, I have also done similar tests with your v12 for a couple
> > of hours and this looks stable under installcheck-world. I can see
> > that you've reworked quite a bit the surroundings of InitializedFrom
> > in this one. If you apply that once again at some point, the
> > buildfarm will be judge in the long-term, but I am rather confident by
> > saying that the situation looks better here, at least.
>
> Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> different now, so it must be correct" must be the weakest proof of
> correctness I've heard of!

Michael just volunteered to help Yura and me with testing. He wan't
intended to be reviewer. And he reported that tests looks much more
stable now. I think he is absolutely correct with this.

------
Regards,
Alexander Korotkov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-02 11:59:17
Message-ID: CAPpHfdua1SV8EFhQLsZ5WNZ5uywxXrLu9tWthT-=_q97w5kv-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 28, 2025 at 3:44 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Fri, Feb 28, 2025 at 02:13:23PM +0100, Álvaro Herrera wrote:
> > On 2025-Feb-28, Michael Paquier wrote:
> >> Saying that, I have also done similar tests with your v12 for a couple
> >> of hours and this looks stable under installcheck-world. I can see
> >> that you've reworked quite a bit the surroundings of InitializedFrom
> >> in this one. If you apply that once again at some point, the
> >> buildfarm will be judge in the long-term, but I am rather confident by
> >> saying that the situation looks better here, at least.
> >
> > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > different now, so it must be correct" must be the weakest proof of
> > correctness I've heard of!
>
> Err, okay. I did use the word "stable" with tests rather than
> "correct", and I implied upthread that I did not check the correctness
> nor the internals of the patch. If my words held the meaning you
> are implying, well, my apologies for the confusion, I guess. I only
> tested the patch and it was stable while I've noticed a few diffs with
> the previous version, but I did *not* check its internals at all, nor
> do I mean that I endorse its logic. I hope that's clear now.

Got it. Michael, thank you very much for your help.

------
Regards,
Alexander Korotkov
Supabase


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-07 15:08:47
Message-ID: CAPpHfdsWcQb-u-9K=ipneBf8CMhoUuBWKYc+XWJEHVdtONOepQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> > On 2025-Feb-28, Michael Paquier wrote:
> >
> > > Saying that, I have also done similar tests with your v12 for a couple
> > > of hours and this looks stable under installcheck-world. I can see
> > > that you've reworked quite a bit the surroundings of InitializedFrom
> > > in this one. If you apply that once again at some point, the
> > > buildfarm will be judge in the long-term, but I am rather confident by
> > > saying that the situation looks better here, at least.
> >
> > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > different now, so it must be correct" must be the weakest proof of
> > correctness I've heard of!
>
> Michael just volunteered to help Yura and me with testing. He wan't
> intended to be reviewer. And he reported that tests looks much more
> stable now. I think he is absolutely correct with this.

Nevertheless, I don't think the bug has gone in v12. I managed to
reproduce it on my local Raspberry PI 4. The attached version of
patch fixes the bug for me. It adds memory barriers surrounding
pg_atomic_compare_exchange_u64(). That certainly not right given this
function should already provide full memory barrier semantics. But my
investigation shows it doesn't. I'm going to start a separate thread
about this.

Also, new version of patch contains fix of potential integer overflow
during OldPageRqstPtr computation sent off-list my me by Yura.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v13-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 14.7 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-13 21:37:59
Message-ID: CAPpHfduKqgkOmEq1jKz_fcm6O39i-C=awhg=9Ed0CoTmX+W-eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 7, 2025 at 5:08 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Sun, Mar 2, 2025 at 1:58 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Fri, Feb 28, 2025 at 3:13 PM Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> > > On 2025-Feb-28, Michael Paquier wrote:
> > >
> > > > Saying that, I have also done similar tests with your v12 for a couple
> > > > of hours and this looks stable under installcheck-world. I can see
> > > > that you've reworked quite a bit the surroundings of InitializedFrom
> > > > in this one. If you apply that once again at some point, the
> > > > buildfarm will be judge in the long-term, but I am rather confident by
> > > > saying that the situation looks better here, at least.
> > >
> > > Heh, no amount of testing can prove lack of bugs; but for sure "it looks
> > > different now, so it must be correct" must be the weakest proof of
> > > correctness I've heard of!
> >
> > Michael just volunteered to help Yura and me with testing. He wan't
> > intended to be reviewer. And he reported that tests looks much more
> > stable now. I think he is absolutely correct with this.
>
> Nevertheless, I don't think the bug has gone in v12. I managed to
> reproduce it on my local Raspberry PI 4. The attached version of
> patch fixes the bug for me. It adds memory barriers surrounding
> pg_atomic_compare_exchange_u64(). That certainly not right given this
> function should already provide full memory barrier semantics. But my
> investigation shows it doesn't. I'm going to start a separate thread
> about this.
>
> Also, new version of patch contains fix of potential integer overflow
> during OldPageRqstPtr computation sent off-list my me by Yura.

So, as we finally clarified CAS doesn't guarantee full memory barrier
on failure. Also, it's not clear when barriers are guaranteed on
success. In ARM without LSE implementation, read barrier is provided
before change of value and write barrier after change of value. So,
it appears that both explicit barriers I've added are required. This
revision also comes with format proof of the algorithm.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v14-0001-Get-rid-of-WALBufMappingLock.patch application/octet-stream 16.7 KB

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-14 14:30:01
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I've briefly looked at this patch this week, and done a bit of testing.
I don't have any comments about the correctness - it does seem correct
to me and I haven't noticed any crashes/issues, but I'm not familiar
with the WALBufMappingLock enough to have insightful opinions.

I have however decided to do a bit of benchmarking, to better understand
the possible benefits of the change. I happen to have access to an Azure
machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
can do ~1.5GB/s.

The benchmark script (attached) uses the workload mentioned by Andres
some time ago [1]

SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));

with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
results look like this (this is throughput):

| 8 | 64 | 1024
clients | master patched | master patched | master patched
---------------------------------------------------------------------
1 | 11864 12035 | 7419 7345 | 968 940
4 | 26311 26919 | 12414 12308 | 1304 1293
8 | 38742 39651 | 14316 14539 | 1348 1348
16 | 57299 59917 | 15405 15871 | 1304 1279
32 | 74857 82598 | 17589 17126 | 1233 1233
48 | 87596 95495 | 18616 18160 | 1199 1227
64 | 89982 97715 | 19033 18910 | 1196 1221
96 | 92853 103448 | 19694 19706 | 1190 1210
128 | 95392 103324 | 20085 19873 | 1188 1213
160 | 94933 102236 | 20227 20323 | 1180 1214
196 | 95933 103341 | 20448 20513 | 1188 1199

To put this into a perspective, this throughput relative to master:

clients | 8 64 1024
----------------------------------
1 | 101% 99% 97%
4 | 102% 99% 99%
8 | 102% 102% 100%
16 | 105% 103% 98%
32 | 110% 97% 100%
48 | 109% 98% 102%
64 | 109% 99% 102%
96 | 111% 100% 102%
128 | 108% 99% 102%
160 | 108% 100% 103%
196 | 108% 100% 101%

That does not seem like a huge improvement :-( Yes, there's 1-10%
speedup for the small (8K) size, but for larger chunks it's a wash.

Looking at the pgbench progress, I noticed stuff like this:

...
progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
...

i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
things (it's a virtual SSD, I'm not sure what model is that behind the
curtains). So I decided to try running the benchmark on tmpfs, to get
the storage out of the way and get the "best case" results.

This makes the pgbench progress perfectly "smooth" (no jumps like in the
output above), and the comparison looks like this:

| 8 | 64 | 1024
clients | master patched | master patched | master patched
---------|---------------------|--------------------|----------------
1 | 32449 32032 | 19289 20344 | 3108 3081
4 | 68779 69256 | 24585 29912 | 2915 3449
8 | 79787 100655 | 28217 39217 | 3182 4086
16 | 113024 148968 | 42969 62083 | 5134 5712
32 | 125884 170678 | 44256 71183 | 4910 5447
48 | 125571 166695 | 44693 76411 | 4717 5215
64 | 122096 160470 | 42749 83754 | 4631 5103
96 | 120170 154145 | 42696 86529 | 4556 5020
128 | 119204 152977 | 40880 88163 | 4529 5047
160 | 116081 152708 | 42263 88066 | 4512 5000
196 | 115364 152455 | 40765 88602 | 4505 4952

and the comparison to master:

clients 8 64 1024
-----------------------------------------
1 99% 105% 99%
4 101% 122% 118%
8 126% 139% 128%
16 132% 144% 111%
32 136% 161% 111%
48 133% 171% 111%
64 131% 196% 110%
96 128% 203% 110%
128 128% 216% 111%
160 132% 208% 111%
196 132% 217% 110%

Yes, with tmpfs the impact looks much more significant. For 8K the
speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.

That being said, I wonder how big is the impact for practical workloads.
ISTM this workload is pretty narrow / extreme, it'd be much easier if we
had an example of a more realistic workload, benefiting from this. Of
course, it may be the case that there are multiple related bottlenecks,
and we'd need to fix all of them - in which case it'd be silly to block
the improvements on the grounds that it alone does not help.

Another thought is that this is testing the "good case". Can anyone
think of a workload that would be made worse by the patch?

regards

--
Tomas Vondra

Attachment Content-Type Size
wal-lock-test.sh application/x-shellscript 976 bytes
patched-tmpfs.csv text/csv 2.1 KB
master-tmpfs.csv text/csv 2.1 KB
master-ssd.csv text/csv 3.4 KB
patched-ssd.csv text/csv 3.4 KB

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-21 11:02:19
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Good day, Tomas

14.03.2025 17:30, Tomas Vondra wrote:
>
> Yes, with tmpfs the impact looks much more significant. For 8K the
> speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
>
>
> That being said, I wonder how big is the impact for practical workloads.
> ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> had an example of a more realistic workload, benefiting from this. Of
> course, it may be the case that there are multiple related bottlenecks,
> and we'd need to fix all of them - in which case it'd be silly to block
> the improvements on the grounds that it alone does not help.

Yes, I found this bottleneck when I did experiments with increasing
NUM_XLOGINSERT_LOCKS [1]. For this patch to be more valuable, there should
be more parallel xlog inserters.

That is why I initially paired this patch with patch that reduces
contention on WALInsertLocks ("0002-Several attempts to lock
WALInsertLock", last version at [2]).

Certainly, largest bottleneck is WALWriteLock around writting buffers and
especially fsync-ing them after write. But this intermediate bottleneck of
WALBufMappingLock is also worth to be removed.

[1]
https://postgr.es/m/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
[2] https://postgr.es/m/c31158a3-7c26-4b26-90df-2df8f7bbe736%40postgrespro.ru

-------
regards
Yura Sokolov aka funny-falcon


From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-31 10:42:01
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Good day,

14.03.2025 17:30, Tomas Vondra wrote:
> Hi,
>
> I've briefly looked at this patch this week, and done a bit of testing.
> I don't have any comments about the correctness - it does seem correct
> to me and I haven't noticed any crashes/issues, but I'm not familiar
> with the WALBufMappingLock enough to have insightful opinions.
>
> I have however decided to do a bit of benchmarking, to better understand
> the possible benefits of the change. I happen to have access to an Azure
> machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> can do ~1.5GB/s.
>
> The benchmark script (attached) uses the workload mentioned by Andres
> some time ago [1]
>
> SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
>
> with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> results look like this (this is throughput):
>
> | 8 | 64 | 1024
> clients | master patched | master patched | master patched
> ---------------------------------------------------------------------
> 1 | 11864 12035 | 7419 7345 | 968 940
> 4 | 26311 26919 | 12414 12308 | 1304 1293
> 8 | 38742 39651 | 14316 14539 | 1348 1348
> 16 | 57299 59917 | 15405 15871 | 1304 1279
> 32 | 74857 82598 | 17589 17126 | 1233 1233
> 48 | 87596 95495 | 18616 18160 | 1199 1227
> 64 | 89982 97715 | 19033 18910 | 1196 1221
> 96 | 92853 103448 | 19694 19706 | 1190 1210
> 128 | 95392 103324 | 20085 19873 | 1188 1213
> 160 | 94933 102236 | 20227 20323 | 1180 1214
> 196 | 95933 103341 | 20448 20513 | 1188 1199
>
> To put this into a perspective, this throughput relative to master:
>
> clients | 8 64 1024
> ----------------------------------
> 1 | 101% 99% 97%
> 4 | 102% 99% 99%
> 8 | 102% 102% 100%
> 16 | 105% 103% 98%
> 32 | 110% 97% 100%
> 48 | 109% 98% 102%
> 64 | 109% 99% 102%
> 96 | 111% 100% 102%
> 128 | 108% 99% 102%
> 160 | 108% 100% 103%
> 196 | 108% 100% 101%
>
> That does not seem like a huge improvement :-( Yes, there's 1-10%
> speedup for the small (8K) size, but for larger chunks it's a wash.
>
> Looking at the pgbench progress, I noticed stuff like this:
>
> ...
> progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> ...
>
> i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> things (it's a virtual SSD, I'm not sure what model is that behind the
> curtains). So I decided to try running the benchmark on tmpfs, to get
> the storage out of the way and get the "best case" results.
>
> This makes the pgbench progress perfectly "smooth" (no jumps like in the
> output above), and the comparison looks like this:
>
> | 8 | 64 | 1024
> clients | master patched | master patched | master patched
> ---------|---------------------|--------------------|----------------
> 1 | 32449 32032 | 19289 20344 | 3108 3081
> 4 | 68779 69256 | 24585 29912 | 2915 3449
> 8 | 79787 100655 | 28217 39217 | 3182 4086
> 16 | 113024 148968 | 42969 62083 | 5134 5712
> 32 | 125884 170678 | 44256 71183 | 4910 5447
> 48 | 125571 166695 | 44693 76411 | 4717 5215
> 64 | 122096 160470 | 42749 83754 | 4631 5103
> 96 | 120170 154145 | 42696 86529 | 4556 5020
> 128 | 119204 152977 | 40880 88163 | 4529 5047
> 160 | 116081 152708 | 42263 88066 | 4512 5000
> 196 | 115364 152455 | 40765 88602 | 4505 4952
>
> and the comparison to master:
>
> clients 8 64 1024
> -----------------------------------------
> 1 99% 105% 99%
> 4 101% 122% 118%
> 8 126% 139% 128%
> 16 132% 144% 111%
> 32 136% 161% 111%
> 48 133% 171% 111%
> 64 131% 196% 110%
> 96 128% 203% 110%
> 128 128% 216% 111%
> 160 132% 208% 111%
> 196 132% 217% 110%
>
> Yes, with tmpfs the impact looks much more significant. For 8K the
> speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
>
>
> That being said, I wonder how big is the impact for practical workloads.
> ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> had an example of a more realistic workload, benefiting from this. Of
> course, it may be the case that there are multiple related bottlenecks,
> and we'd need to fix all of them - in which case it'd be silly to block
> the improvements on the grounds that it alone does not help.
>
> Another thought is that this is testing the "good case". Can anyone
> think of a workload that would be made worse by the patch?

I've made similar benchmark on system with two Xeon Gold 5220R with two
Samsung SSD 970 PRO 1TB mirrored by md.

Configuration changes:
wal_sync_method = open_datasync
full_page_writes = off
synchronous_commit = off
checkpoint_timeout = 1d
max_connections = 1000
max_wal_size = 4GB
min_wal_size = 640MB

I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
1GB) and record size (1kB, 8kB and 64kB).

(I didn't bench 1MB record size, since I don't believe it is critical for
performance).

Here's results for 64MB segment size and 1GB wal_buffers:

+---------+---------+------------+--------------+----------+
| recsize | clients | master_tps | nowalbuf_tps | rel_perf |
+---------+---------+------------+--------------+----------+
| 1 | 1 | 47991.0 | 46995.0 | 0.98 |
| 1 | 4 | 171930.0 | 171166.0 | 1.0 |
| 1 | 16 | 491240.0 | 485132.0 | 0.99 |
| 1 | 64 | 514590.0 | 515534.0 | 1.0 |
| 1 | 128 | 547222.0 | 543543.0 | 0.99 |
| 1 | 256 | 543353.0 | 540802.0 | 1.0 |
| 8 | 1 | 40976.0 | 41603.0 | 1.02 |
| 8 | 4 | 89003.0 | 92008.0 | 1.03 |
| 8 | 16 | 90457.0 | 92282.0 | 1.02 |
| 8 | 64 | 89293.0 | 92022.0 | 1.03 |
| 8 | 128 | 92687.0 | 92768.0 | 1.0 |
| 8 | 256 | 91874.0 | 91665.0 | 1.0 |
| 64 | 1 | 11829.0 | 12031.0 | 1.02 |
| 64 | 4 | 11959.0 | 12832.0 | 1.07 |
| 64 | 16 | 11331.0 | 13417.0 | 1.18 |
| 64 | 64 | 11108.0 | 13588.0 | 1.22 |
| 64 | 128 | 11089.0 | 13648.0 | 1.23 |
| 64 | 256 | 10381.0 | 13542.0 | 1.3 |
+---------+---------+------------+--------------+----------+

Numbers for all configurations in attached 'improvements.out' . It shows,
removing WALBufMappingLock almost always doesn't harm performance and
usually gives measurable gain.

(Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
maximum tps from 6 runs and took average from remaining).

Also sqlite database is attached with all results. It also contains results
for patch "Several attempts to lock WALInsertLock" (named "attempts") and
cumulative patch ("nowalbuf-attempts").
Suprisingly, "Several attempts" causes measurable impact in some
configurations with hundreds of clients. So, there're more bottlenecks ahead ))

Yes, it is still not "real-world" benchmark. But it at least shows patch is
harmless.

--
regards
Yura Sokolov aka funny-falcon

Attachment Content-Type Size
improvements.out text/plain 8.6 KB
results.db.zst application/zstd 37.1 KB

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-31 18:18:30
Message-ID: CAPpHfdtOBgoUfGiw9exZPpS7e4EE7SdEBRt9VDfgYJpC2Jd5DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 14.03.2025 17:30, Tomas Vondra wrote:
> > Hi,
> >
> > I've briefly looked at this patch this week, and done a bit of testing.
> > I don't have any comments about the correctness - it does seem correct
> > to me and I haven't noticed any crashes/issues, but I'm not familiar
> > with the WALBufMappingLock enough to have insightful opinions.
> >
> > I have however decided to do a bit of benchmarking, to better understand
> > the possible benefits of the change. I happen to have access to an Azure
> > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> > can do ~1.5GB/s.
> >
> > The benchmark script (attached) uses the workload mentioned by Andres
> > some time ago [1]
> >
> > SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> >
> > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> > results look like this (this is throughput):
> >
> > | 8 | 64 | 1024
> > clients | master patched | master patched | master patched
> > ---------------------------------------------------------------------
> > 1 | 11864 12035 | 7419 7345 | 968 940
> > 4 | 26311 26919 | 12414 12308 | 1304 1293
> > 8 | 38742 39651 | 14316 14539 | 1348 1348
> > 16 | 57299 59917 | 15405 15871 | 1304 1279
> > 32 | 74857 82598 | 17589 17126 | 1233 1233
> > 48 | 87596 95495 | 18616 18160 | 1199 1227
> > 64 | 89982 97715 | 19033 18910 | 1196 1221
> > 96 | 92853 103448 | 19694 19706 | 1190 1210
> > 128 | 95392 103324 | 20085 19873 | 1188 1213
> > 160 | 94933 102236 | 20227 20323 | 1180 1214
> > 196 | 95933 103341 | 20448 20513 | 1188 1199
> >
> > To put this into a perspective, this throughput relative to master:
> >
> > clients | 8 64 1024
> > ----------------------------------
> > 1 | 101% 99% 97%
> > 4 | 102% 99% 99%
> > 8 | 102% 102% 100%
> > 16 | 105% 103% 98%
> > 32 | 110% 97% 100%
> > 48 | 109% 98% 102%
> > 64 | 109% 99% 102%
> > 96 | 111% 100% 102%
> > 128 | 108% 99% 102%
> > 160 | 108% 100% 103%
> > 196 | 108% 100% 101%
> >
> > That does not seem like a huge improvement :-( Yes, there's 1-10%
> > speedup for the small (8K) size, but for larger chunks it's a wash.
> >
> > Looking at the pgbench progress, I noticed stuff like this:
> >
> > ...
> > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> > ...
> >
> > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> > things (it's a virtual SSD, I'm not sure what model is that behind the
> > curtains). So I decided to try running the benchmark on tmpfs, to get
> > the storage out of the way and get the "best case" results.
> >
> > This makes the pgbench progress perfectly "smooth" (no jumps like in the
> > output above), and the comparison looks like this:
> >
> > | 8 | 64 | 1024
> > clients | master patched | master patched | master patched
> > ---------|---------------------|--------------------|----------------
> > 1 | 32449 32032 | 19289 20344 | 3108 3081
> > 4 | 68779 69256 | 24585 29912 | 2915 3449
> > 8 | 79787 100655 | 28217 39217 | 3182 4086
> > 16 | 113024 148968 | 42969 62083 | 5134 5712
> > 32 | 125884 170678 | 44256 71183 | 4910 5447
> > 48 | 125571 166695 | 44693 76411 | 4717 5215
> > 64 | 122096 160470 | 42749 83754 | 4631 5103
> > 96 | 120170 154145 | 42696 86529 | 4556 5020
> > 128 | 119204 152977 | 40880 88163 | 4529 5047
> > 160 | 116081 152708 | 42263 88066 | 4512 5000
> > 196 | 115364 152455 | 40765 88602 | 4505 4952
> >
> > and the comparison to master:
> >
> > clients 8 64 1024
> > -----------------------------------------
> > 1 99% 105% 99%
> > 4 101% 122% 118%
> > 8 126% 139% 128%
> > 16 132% 144% 111%
> > 32 136% 161% 111%
> > 48 133% 171% 111%
> > 64 131% 196% 110%
> > 96 128% 203% 110%
> > 128 128% 216% 111%
> > 160 132% 208% 111%
> > 196 132% 217% 110%
> >
> > Yes, with tmpfs the impact looks much more significant. For 8K the
> > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> >
> >
> > That being said, I wonder how big is the impact for practical workloads.
> > ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> > had an example of a more realistic workload, benefiting from this. Of
> > course, it may be the case that there are multiple related bottlenecks,
> > and we'd need to fix all of them - in which case it'd be silly to block
> > the improvements on the grounds that it alone does not help.
> >
> > Another thought is that this is testing the "good case". Can anyone
> > think of a workload that would be made worse by the patch?
>
> I've made similar benchmark on system with two Xeon Gold 5220R with two
> Samsung SSD 970 PRO 1TB mirrored by md.
>
> Configuration changes:
> wal_sync_method = open_datasync
> full_page_writes = off
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_connections = 1000
> max_wal_size = 4GB
> min_wal_size = 640MB
>
> I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
> 1GB) and record size (1kB, 8kB and 64kB).
>
> (I didn't bench 1MB record size, since I don't believe it is critical for
> performance).
>
> Here's results for 64MB segment size and 1GB wal_buffers:
>
> +---------+---------+------------+--------------+----------+
> | recsize | clients | master_tps | nowalbuf_tps | rel_perf |
> +---------+---------+------------+--------------+----------+
> | 1 | 1 | 47991.0 | 46995.0 | 0.98 |
> | 1 | 4 | 171930.0 | 171166.0 | 1.0 |
> | 1 | 16 | 491240.0 | 485132.0 | 0.99 |
> | 1 | 64 | 514590.0 | 515534.0 | 1.0 |
> | 1 | 128 | 547222.0 | 543543.0 | 0.99 |
> | 1 | 256 | 543353.0 | 540802.0 | 1.0 |
> | 8 | 1 | 40976.0 | 41603.0 | 1.02 |
> | 8 | 4 | 89003.0 | 92008.0 | 1.03 |
> | 8 | 16 | 90457.0 | 92282.0 | 1.02 |
> | 8 | 64 | 89293.0 | 92022.0 | 1.03 |
> | 8 | 128 | 92687.0 | 92768.0 | 1.0 |
> | 8 | 256 | 91874.0 | 91665.0 | 1.0 |
> | 64 | 1 | 11829.0 | 12031.0 | 1.02 |
> | 64 | 4 | 11959.0 | 12832.0 | 1.07 |
> | 64 | 16 | 11331.0 | 13417.0 | 1.18 |
> | 64 | 64 | 11108.0 | 13588.0 | 1.22 |
> | 64 | 128 | 11089.0 | 13648.0 | 1.23 |
> | 64 | 256 | 10381.0 | 13542.0 | 1.3 |
> +---------+---------+------------+--------------+----------+
>
> Numbers for all configurations in attached 'improvements.out' . It shows,
> removing WALBufMappingLock almost always doesn't harm performance and
> usually gives measurable gain.
>
> (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
> maximum tps from 6 runs and took average from remaining).
>
> Also sqlite database is attached with all results. It also contains results
> for patch "Several attempts to lock WALInsertLock" (named "attempts") and
> cumulative patch ("nowalbuf-attempts").
> Suprisingly, "Several attempts" causes measurable impact in some
> configurations with hundreds of clients. So, there're more bottlenecks ahead ))
>
>
> Yes, it is still not "real-world" benchmark. But it at least shows patch is
> harmless.

Thank you for your experiments. Your results shows up to 30% speedups
on real hardware, not tmpfs. While this is still a corner case, I
think this is quite a results for a pretty local optimization. On
small connection number there are some cases above and below 1.0. I
think this due to statistical error. If we would calculate average
tps ratio across different experiments, for low number of clients it's
still above 1.0.

sqlite> select clients, avg(ratio) from (select walseg, walbuf,
recsize, clients, (avg(tps) filter (where branch =
'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from
results where branch in ('master', 'nowalbuf') group by walseg,
walbuf, recsize, clients) x group by clients;
1|1.00546614169766
4|1.00782085856889
16|1.02257892337757
64|1.04400167838906
128|1.04134006876033
256|1.04627949500578

I'm going to push the first patch ("nowalbuf") if no objections. I
think the second one ("Several attempts") still needs more work, as
there are regressions.

------
Regards,
Alexander Korotkov
Supabase