Quick Links

sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum

Lists:	pgsql-hackers

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-02-03 12:05:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Good day,

Investigating some performance issues of a client, our engineers found
msgnumLock to be contended.

Looking closer it is obvious it is not needed at all: it used only as
memory barrier. It is even stated in comment at file start:

* We deal with that by having a spinlock that readers must take for just
* long enough to read maxMsgNum, while writers take it for just long enough
* to write maxMsgNum. (The exact rule is that you need the spinlock to
* read maxMsgNum if you are not holding SInvalWriteLock, and you need the
* spinlock to write maxMsgNum unless you are holding both locks.)
*
* Note: since maxMsgNum is an int and hence presumably atomically readable/
* writable, the spinlock might seem unnecessary. The reason it is needed
* is to provide a memory barrier: we need to be sure that messages written
* to the array are actually there before maxMsgNum is increased, and that
* readers will see that data after fetching maxMsgNum.

So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
barriers:
- pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
is held)
- pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)

It improved situation for our client.

Note: pg_(write|read)_barrier() is chosen instead of
pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
least on x86_64 where it is translated to just a compiler barrier (empty
asm statement).
At least pg_atomic_read_membarrier_u32() is implemented currently as a
write operation, that's not good for contended place.

-----

regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v0-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patch	text/x-patch	6.3 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-02-03 16:49:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/02/2025 13:05, Yura Sokolov wrote:
> Investigating some performance issues of a client, our engineers found
> msgnumLock to be contended.
>
> Looking closer it is obvious it is not needed at all: it used only as
> memory barrier. It is even stated in comment at file start:
>
> * We deal with that by having a spinlock that readers must take for just
> * long enough to read maxMsgNum, while writers take it for just long enough
> * to write maxMsgNum. (The exact rule is that you need the spinlock to
> * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
> * spinlock to write maxMsgNum unless you are holding both locks.)
> *
> * Note: since maxMsgNum is an int and hence presumably atomically readable/
> * writable, the spinlock might seem unnecessary. The reason it is needed
> * is to provide a memory barrier: we need to be sure that messages written
> * to the array are actually there before maxMsgNum is increased, and that
> * readers will see that data after fetching maxMsgNum.
>
> So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
> barriers:
> - pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
> is held)
> - pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)
>
> It improved situation for our client.
>
> Note: pg_(write|read)_barrier() is chosen instead of
> pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
> least on x86_64 where it is translated to just a compiler barrier (empty
> asm statement).
> At least pg_atomic_read_membarrier_u32() is implemented currently as a
> write operation, that's not good for contended place.

Makes sense.

> @@ -640,8 +626,12 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
> */
> if (min >= MSGNUMWRAPAROUND)
> {
> - segP->minMsgNum -= MSGNUMWRAPAROUND;
> - segP->maxMsgNum -= MSGNUMWRAPAROUND;
> + /*
> + * we don't need memory barrier here, but using sub_fetch is just
> + * simpler than read_u32+write_u32 pair, and this place is not
> + * contented.
> + */
> + pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
> for (i = 0; i < segP->numProcs; i++)
> segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
> }

Did you lose the 'segP->minMsgNum -= MSGNUMWRAPAROUND;' here? Do we have
any tests for the wraparound?

Now that maxMsgNum is unsigned, should we switch to uint32 for minMsgNum
and nextThreshold for consistency? They still don't need to be atomic
IIRC, they're protected by SInvalReadLock and SInvalWriteLock.

I kind of wonder why we need that MSGNUMWRAPAROUND limit at all; can't
we just let the integer wrap around naturally? (after switching to
unsigned, that is). That doesn't need to be part of this patch though,
it can be done separately, if it's worthwhile..
.
--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-02-04 10:36:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

03.02.2025 19:49, Heikki Linnakangas пишет:
> On 03/02/2025 13:05, Yura Sokolov wrote:
>> Investigating some performance issues of a client, our engineers found
>> msgnumLock to be contended.
>>
>> Looking closer it is obvious it is not needed at all: it used only as
>> memory barrier. It is even stated in comment at file start:
>>
>> * We deal with that by having a spinlock that readers must take for just
>> * long enough to read maxMsgNum, while writers take it for just long enough
>> * to write maxMsgNum. (The exact rule is that you need the spinlock to
>> * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
>> * spinlock to write maxMsgNum unless you are holding both locks.)
>> *
>> * Note: since maxMsgNum is an int and hence presumably atomically readable/
>> * writable, the spinlock might seem unnecessary. The reason it is needed
>> * is to provide a memory barrier: we need to be sure that messages written
>> * to the array are actually there before maxMsgNum is increased, and that
>> * readers will see that data after fetching maxMsgNum.
>>
>> So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
>> barriers:
>> - pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
>> is held)
>> - pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)
>>
>> It improved situation for our client.
>>
>> Note: pg_(write|read)_barrier() is chosen instead of
>> pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
>> least on x86_64 where it is translated to just a compiler barrier (empty
>> asm statement).
>> At least pg_atomic_read_membarrier_u32() is implemented currently as a
>> write operation, that's not good for contended place.
>
> Makes sense.
>
>
>> @@ -640,8 +626,12 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
>> */
>> if (min >= MSGNUMWRAPAROUND)
>> {
>> - segP->minMsgNum -= MSGNUMWRAPAROUND;
>> - segP->maxMsgNum -= MSGNUMWRAPAROUND;
>> + /*
>> + * we don't need memory barrier here, but using sub_fetch is just
>> + * simpler than read_u32+write_u32 pair, and this place is not
>> + * contented.
>> + */
>> + pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
>> for (i = 0; i < segP->numProcs; i++)
>> segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
>> }
>
> Did you lose the 'segP->minMsgNum -= MSGNUMWRAPAROUND;' here? Do we have
> any tests for the wraparound?

Oops, yes, thanks. I didn't miss it in private version, but lose during
cherry-picking on top of 'master'. Fixed.

Also removed comment above sub_fetch.

> Now that maxMsgNum is unsigned, should we switch to uint32 for minMsgNum
> and nextThreshold for consistency? They still don't need to be atomic
> IIRC, they're protected by SInvalReadLock and SInvalWriteLock.

Ok, I did. And couple of local variables.
There are signed arguments of SIInsertDataEntries, SIGetDataEntries,
SICleanupQueue . Should they be changed to unsigned?

> I kind of wonder why we need that MSGNUMWRAPAROUND limit at all; can't
> we just let the integer wrap around naturally? (after switching to
> unsigned, that is). That doesn't need to be part of this patch though,
> it can be done separately, if it's worthwhile...

There still will be need to handle wraparound, but in comparisons then.
Overall, code will not be simpler, I think. Are there any other benefits
from its removal?

-------

regards,
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v1-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patch	text/x-patch	7.8 KB

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-21 11:35:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Just rebased the patch.

-------
regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v2-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patch	text/x-patch	7.8 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-21 16:33:55
Message-ID:	5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus@wppo6zcfszay
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
> From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
> Date: Mon, 3 Feb 2025 11:58:33 +0300
> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>
> msgnumLock spinlock could be highly contended.
> Comment states it was used as memory barrier.
> Lets use atomic ops with memory barriers directly instead.
>
> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
> no full barrier semantic is required, and explicit read/write barriers
> are cheaper at least on x86_64.

Is it actually true that full barriers aren't required? I think we might
actually rely on a stronger ordering.

> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
> */
> stateP->hasMessages = false;
>
> - /* Fetch current value of maxMsgNum using spinlock */
> - SpinLockAcquire(&segP->msgnumLock);
> - max = segP->maxMsgNum;
> - SpinLockRelease(&segP->msgnumLock);
> + /* Fetch current value of maxMsgNum using atomic with memory barrier */
> + max = pg_atomic_read_u32(&segP->maxMsgNum);
> + pg_read_barrier();
>
> if (stateP->resetState)
> {
> /*
> * Force reset. We can say we have dealt with any messages added
> * since the reset, as well; and that means we should clear the
> * signaled flag, too.
> */
> stateP->nextMsgNum = max;
> stateP->resetState = false;
> stateP->signaled = false;
> LWLockRelease(SInvalReadLock);
> return -1;
> }

> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
> /*
> * Insert new message(s) into proper slot of circular buffer
> */
> - max = segP->maxMsgNum;
> + max = pg_atomic_read_u32(&segP->maxMsgNum);
> while (nthistime-- > 0)
> {
> segP->buffer[max % MAXNUMMESSAGES] = *data++;
> max++;
> }
>
> - /* Update current value of maxMsgNum using spinlock */
> - SpinLockAcquire(&segP->msgnumLock);
> - segP->maxMsgNum = max;
> - SpinLockRelease(&segP->msgnumLock);
> + /* Update current value of maxMsgNum using atomic with memory barrier */
> + pg_write_barrier();
> + pg_atomic_write_u32(&segP->maxMsgNum, max);
>
> /*
> * Now that the maxMsgNum change is globally visible, we give everyone
> * a swift kick to make sure they read the newly added messages.
> * Releasing SInvalWriteLock will enforce a full memory barrier, so
> * these (unlocked) changes will be committed to memory before we exit
> * the function.
> */
> for (i = 0; i < segP->numProcs; i++)
> {
> ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
>
> stateP->hasMessages = true;
> }

On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
could be reordered with the read of maxMsgNum. Which, afaict, would lead to
missing messages. That's not prevented by the pg_write_barrier() in
SIInsertDataEntries(). I think there may be other similar dangers.

This could be solved by adding full memory barriers in a few places. But:

I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.

Have you verified that this actually addresses the performance issue?

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-24 10:41:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Andres

21.03.2025 19:33, Andres Freund wrote:
> Hi,
>
> On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
>> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
>> From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
>> Date: Mon, 3 Feb 2025 11:58:33 +0300
>> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>>
>> msgnumLock spinlock could be highly contended.
>> Comment states it was used as memory barrier.
>> Lets use atomic ops with memory barriers directly instead.
>>
>> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
>> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
>> no full barrier semantic is required, and explicit read/write barriers
>> are cheaper at least on x86_64.
>
> Is it actually true that full barriers aren't required? I think we might
> actually rely on a stronger ordering.
>
>
>> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>> */
>> stateP->hasMessages = false;
>>
>> - /* Fetch current value of maxMsgNum using spinlock */
>> - SpinLockAcquire(&segP->msgnumLock);
>> - max = segP->maxMsgNum;
>> - SpinLockRelease(&segP->msgnumLock);
>> + /* Fetch current value of maxMsgNum using atomic with memory barrier */
>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>> + pg_read_barrier();
>>
>> if (stateP->resetState)
>> {
>> /*
>> * Force reset. We can say we have dealt with any messages added
>> * since the reset, as well; and that means we should clear the
>> * signaled flag, too.
>> */
>> stateP->nextMsgNum = max;
>> stateP->resetState = false;
>> stateP->signaled = false;
>> LWLockRelease(SInvalReadLock);
>> return -1;
>> }
>
> vs
>
>> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>> /*
>> * Insert new message(s) into proper slot of circular buffer
>> */
>> - max = segP->maxMsgNum;
>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>> while (nthistime-- > 0)
>> {
>> segP->buffer[max % MAXNUMMESSAGES] = *data++;
>> max++;
>> }
>>
>> - /* Update current value of maxMsgNum using spinlock */
>> - SpinLockAcquire(&segP->msgnumLock);
>> - segP->maxMsgNum = max;
>> - SpinLockRelease(&segP->msgnumLock);
>> + /* Update current value of maxMsgNum using atomic with memory barrier */
>> + pg_write_barrier();
>> + pg_atomic_write_u32(&segP->maxMsgNum, max);
>>
>> /*
>> * Now that the maxMsgNum change is globally visible, we give everyone
>> * a swift kick to make sure they read the newly added messages.
>> * Releasing SInvalWriteLock will enforce a full memory barrier, so
>> * these (unlocked) changes will be committed to memory before we exit
>> * the function.
>> */
>> for (i = 0; i < segP->numProcs; i++)
>> {
>> ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
>>
>> stateP->hasMessages = true;
>> }
>
> On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
> could be reordered with the read of maxMsgNum. Which, afaict, would lead to
> missing messages. That's not prevented by the pg_write_barrier() in
> SIInsertDataEntries(). I think there may be other similar dangers.
>
> This could be solved by adding full memory barriers in a few places.

Big thanks for review and suggestion!

I agree, pg_memory_barrier should be added before read of segP->maxMsgNum.
I think, change of stateP->hasMessages to atomic variable is better way,
but it will change sizeof ProcState.

I don't see the need to full barrier after read of maxMsgNum, since other
ProcState fields are protected by SInvalReadLock, aren't they? So I leave
read_barrier there.

I still avoid use of read_membarrier since it is actually write operation.
Although pg_memory_barrier is implemented as write operation as well at
x86_64, but on memory cell on process's stack, so it will not be contended.

And atomic_write_membarrier is used to write maxMsgNum just to simplify
code. If backport is considered, then write_barriers before and after could
be used instead.

Fixes version is attached.

> But:
>
> I'd also like to know a bit more about the motivation here - I can easily
> believe that you hit contention around the shared inval queue, but I find it
> somewhat hard to believe that a spinlock that's acquired *once* per batch
> ("quantum"), covering a single read/write, is going to be the bottleneck,
> rather than the much longer held LWLock, that protects iterating over all
> procs.
>
> Have you verified that this actually addresses the performance issue?

Problem on this spinlock were observed at least by two independent technical
supports. First, some friendly vendor company shared the idea to remove it.
We don't know exactly their situation. But I suppose it was quite similar
to situation out tech support investigated at our client some months later:

(Cite from tech support report:)
> Almost 20% of CPU time is spend at backtraces like:
4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
49c847 SIGetDataEntries
49bf94 ReceiveSharedInvalidMessages
4a14ba LockRelationOid
1671f4 relation_open
1de1cd table_open
5e82aa RelationGetStatExtList
402a01 get_relation_statistics (inlined)
402a01 get_relation_info
407a9e build_simple_rel
3daa1d add_base_rels_to_query
3daa1d add_base_rels_to_query
3dd92b query_planner

Client has many NUMA-nodes in single machine, and software actively
generates invalidation messages (probably, due active usage of temporary
tables).

Since, backtrace is quite obvious and ends at s_lock, the patch have to help.

--
regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v3-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patch	text/x-patch	7.6 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-24 13:08:05
Message-ID:	4yfgfeu3f6f7fayl5h3kggtd5bkvm4gj3a3ryjs2qhlo6g74bt@g3cu2u4pgaiw
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
> 21.03.2025 19:33, Andres Freund wrote:
> > I'd also like to know a bit more about the motivation here - I can easily
> > believe that you hit contention around the shared inval queue, but I find it
> > somewhat hard to believe that a spinlock that's acquired *once* per batch
> > ("quantum"), covering a single read/write, is going to be the bottleneck,
> > rather than the much longer held LWLock, that protects iterating over all
> > procs.
> >
> > Have you verified that this actually addresses the performance issue?
>
> Problem on this spinlock were observed at least by two independent technical
> supports. First, some friendly vendor company shared the idea to remove it.
> We don't know exactly their situation. But I suppose it was quite similar
> to situation out tech support investigated at our client some months later:
>
> (Cite from tech support report:)
> > Almost 20% of CPU time is spend at backtraces like:
> 4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
> 49c847 SIGetDataEntries
> 49bf94 ReceiveSharedInvalidMessages
> 4a14ba LockRelationOid
> 1671f4 relation_open
> 1de1cd table_open
> 5e82aa RelationGetStatExtList
> 402a01 get_relation_statistics (inlined)
> 402a01 get_relation_info
> 407a9e build_simple_rel
> 3daa1d add_base_rels_to_query
> 3daa1d add_base_rels_to_query
> 3dd92b query_planner
>
>
> Client has many NUMA-nodes in single machine, and software actively
> generates invalidation messages (probably, due active usage of temporary
> tables).
>
> Since, backtrace is quite obvious and ends at s_lock, the patch have to help.

I don't believe we have the whole story here. It just doesn't seem plausible
that, with the current code, the spinlock in SIGetDataEntries() would be the
bottleneck, rather than contention on the lwlock. The spinlock just covers a
*single read*. Unless pgpro has modified the relevant code?

One possible explanation for why the spinlock shows up so badly is that it is
due to false sharing. Note that SiSeg->msgnumLock and the start of
SiSeg->buffer[] are on the same cache line.

How was this "Almost 20% of CPU time is spend at backtraces like" determined?

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-25 10:52:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Good day, Andres

24.03.2025 16:08, Andres Freund wrote:
> On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
>> 21.03.2025 19:33, Andres Freund wrote:
>>> I'd also like to know a bit more about the motivation here - I can easily
>>> believe that you hit contention around the shared inval queue, but I
find it
>>> somewhat hard to believe that a spinlock that's acquired *once* per batch
>>> ("quantum"), covering a single read/write, is going to be the bottleneck,
>>> rather than the much longer held LWLock, that protects iterating over all
>>> procs.
>>>
>>> Have you verified that this actually addresses the performance issue?
>>
>> Problem on this spinlock were observed at least by two independent technical
>> supports. First, some friendly vendor company shared the idea to remove it.
>> We don't know exactly their situation. But I suppose it was quite similar
>> to situation out tech support investigated at our client some months later:
>>
>> (Cite from tech support report:)
>>> Almost 20% of CPU time is spend at backtraces like:
>> 4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
>> 49c847 SIGetDataEntries
>> 49bf94 ReceiveSharedInvalidMessages
>> 4a14ba LockRelationOid
>> 1671f4 relation_open
>> 1de1cd table_open
>> 5e82aa RelationGetStatExtList
>> 402a01 get_relation_statistics (inlined)
>> 402a01 get_relation_info
>> 407a9e build_simple_rel
>> 3daa1d add_base_rels_to_query
>> 3daa1d add_base_rels_to_query
>> 3dd92b query_planner
>>
>>
>> Client has many NUMA-nodes in single machine, and software actively
>> generates invalidation messages (probably, due active usage of temporary
>> tables).
>>
>> Since, backtrace is quite obvious and ends at s_lock, the patch have to
help.
>
> I don't believe we have the whole story here. It just doesn't seem plausible
> that, with the current code, the spinlock in SIGetDataEntries() would be the
> bottleneck, rather than contention on the lwlock. The spinlock just covers a
> *single read*. Unless pgpro has modified the relevant code?
>
> One possible explanation for why the spinlock shows up so badly is that it is
> due to false sharing. Note that SiSeg->msgnumLock and the start of
> SiSeg->buffer[] are on the same cache line.
>
> How was this "Almost 20% of CPU time is spend at backtraces like" determined?

Excuse me I didn't attached flamegraph collected by our tech support from
client's server during peak of the problem. So I attach it now.

If you open it in browser and search for "SIGetDataEntries", you'll see it
consumes 18.4%. It is not single large bar. Instead there are dozens of
calls to SIGetDataEntries, and every one spend almost all its time in
s_lock. If you search for s_lock, it consumes 16.9%, and almost every call
to s_lock is from SIGetDataEntries.

Looks like, we call to ReceiveSharedInvalidMessages
(AcceptInvalidationMessages, actually) too frequently during planing. And
if there are large stream of invalidation messages, SIGetDataEntries have
some work very frequently. Therefore many backends, which plans their
queries at the moment, start to fight for msgNumLock.

If ReceiveSharedInvalidMessages (and SIGetDataEntries by it) were called
rarely, then you conclusion were right: taking spinlock around just read of
one variable before processing large batch of messages looks to not be
source of problem. But since it is called very frequently, and stream of
messages is high, "there is always few new messages".

As I've said, it is most probably due to use of famous 1C software, which
uses a lot of temporary tables. So it generates high amount of invalidation
messages. We've patched pgpro postgres to NOT SEND most of invalidation
messages generated by temporary tables, but it is difficult to not send all
of such.

--
regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
perf20250123_1120.lst.bind.svg.gz	application/gzip	343.4 KB

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Date:	2025-03-25 11:45:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

25.03.2025 13:52, Yura Sokolov пишет:
> Good day, Andres
>
> 24.03.2025 16:08, Andres Freund wrote:
>> On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
>>> 21.03.2025 19:33, Andres Freund wrote:
>>>> I'd also like to know a bit more about the motivation here - I can easily
>>>> believe that you hit contention around the shared inval queue, but I
> find it
>>>> somewhat hard to believe that a spinlock that's acquired *once* per batch
>>>> ("quantum"), covering a single read/write, is going to be the bottleneck,
>>>> rather than the much longer held LWLock, that protects iterating over all
>>>> procs.
>>>>
>>>> Have you verified that this actually addresses the performance issue?
>>>
>>> Problem on this spinlock were observed at least by two independent technical
>>> supports. First, some friendly vendor company shared the idea to remove it.
>>> We don't know exactly their situation. But I suppose it was quite similar
>>> to situation out tech support investigated at our client some months later:
>>>
>>> (Cite from tech support report:)
>>>> Almost 20% of CPU time is spend at backtraces like:
>>> 4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
>>> 49c847 SIGetDataEntries
>>> 49bf94 ReceiveSharedInvalidMessages
>>> 4a14ba LockRelationOid
>>> 1671f4 relation_open
>>> 1de1cd table_open
>>> 5e82aa RelationGetStatExtList
>>> 402a01 get_relation_statistics (inlined)
>>> 402a01 get_relation_info
>>> 407a9e build_simple_rel
>>> 3daa1d add_base_rels_to_query
>>> 3daa1d add_base_rels_to_query
>>> 3dd92b query_planner
>>>
>>>
>>> Client has many NUMA-nodes in single machine, and software actively
>>> generates invalidation messages (probably, due active usage of temporary
>>> tables).
>>>
>>> Since, backtrace is quite obvious and ends at s_lock, the patch have to
> help.
>>
>> I don't believe we have the whole story here. It just doesn't seem plausible
>> that, with the current code, the spinlock in SIGetDataEntries() would be the
>> bottleneck, rather than contention on the lwlock. The spinlock just covers a
>> *single read*. Unless pgpro has modified the relevant code?
>>
>> One possible explanation for why the spinlock shows up so badly is that it is
>> due to false sharing. Note that SiSeg->msgnumLock and the start of
>> SiSeg->buffer[] are on the same cache line.
>>
>> How was this "Almost 20% of CPU time is spend at backtraces like" determined?
>
> Excuse me I didn't attached flamegraph collected by our tech support from
> client's server during peak of the problem. So I attach it now.
>
> If you open it in browser and search for "SIGetDataEntries", you'll see it
> consumes 18.4%. It is not single large bar. Instead there are dozens of
> calls to SIGetDataEntries, and every one spend almost all its time in
> s_lock. If you search for s_lock, it consumes 16.9%, and almost every call
> to s_lock is from SIGetDataEntries.
>
> Looks like, we call to ReceiveSharedInvalidMessages
> (AcceptInvalidationMessages, actually) too frequently during planing. And
> if there are large stream of invalidation messages, SIGetDataEntries have
> some work very frequently. Therefore many backends, which plans their
> queries at the moment, start to fight for msgNumLock.
>
> If ReceiveSharedInvalidMessages (and SIGetDataEntries by it) were called
> rarely, then you conclusion were right: taking spinlock around just read of
> one variable before processing large batch of messages looks to not be
> source of problem. But since it is called very frequently, and stream of
> messages is high, "there is always few new messages".

And one more reason for contention: ReceiveSharedInvalidMessages calls
SIGetDataEntries for batches sized MAXINVALMSGS (=32). It calls it again
and again until queue is empty. There could be a lot of messages pushed to
a queue.

In fact, we have to increase MAXNUMMESSAGES up to 16k to not fall into
InvalidateSystemCaches (which costs much more), so theoretically
SIGetDataEntries could be called upto 512 times per
ReceiveSharedInvalidMessages.

And it is not so theoretically, since COMMIT of transaction may send
thousands of invalidation messages at once.

Probably, increasing MAXINVALMSGS is a good idea as well.

--
regards
Yura Sokolov aka funny-falcon

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	SpinLockAcquire and SpinLockRelease is broken on ARM/ARM64? (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-05-05 07:30:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

21.03.2025 19:33, Andres Freund пишет:
> Hi,
>
> On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
>> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
>> From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
>> Date: Mon, 3 Feb 2025 11:58:33 +0300
>> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>>
>> msgnumLock spinlock could be highly contended.
>> Comment states it was used as memory barrier.
>> Lets use atomic ops with memory barriers directly instead.
>>
>> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
>> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
>> no full barrier semantic is required, and explicit read/write barriers
>> are cheaper at least on x86_64.
>
> Is it actually true that full barriers aren't required? I think we might
> actually rely on a stronger ordering.
>
>
>> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>> */
>> stateP->hasMessages = false;
>>
>> - /* Fetch current value of maxMsgNum using spinlock */
>> - SpinLockAcquire(&segP->msgnumLock);
>> - max = segP->maxMsgNum;
>> - SpinLockRelease(&segP->msgnumLock);
>> + /* Fetch current value of maxMsgNum using atomic with memory barrier */
>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>> + pg_read_barrier();
>>
>> if (stateP->resetState)
>> {
>> /*
>> * Force reset. We can say we have dealt with any messages added
>> * since the reset, as well; and that means we should clear the
>> * signaled flag, too.
>> */
>> stateP->nextMsgNum = max;
>> stateP->resetState = false;
>> stateP->signaled = false;
>> LWLockRelease(SInvalReadLock);
>> return -1;
>> }
>
> vs
>
>> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>> /*
>> * Insert new message(s) into proper slot of circular buffer
>> */
>> - max = segP->maxMsgNum;
>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>> while (nthistime-- > 0)
>> {
>> segP->buffer[max % MAXNUMMESSAGES] = *data++;
>> max++;
>> }
>>
>> - /* Update current value of maxMsgNum using spinlock */
>> - SpinLockAcquire(&segP->msgnumLock);
>> - segP->maxMsgNum = max;
>> - SpinLockRelease(&segP->msgnumLock);
>> + /* Update current value of maxMsgNum using atomic with memory barrier */
>> + pg_write_barrier();
>> + pg_atomic_write_u32(&segP->maxMsgNum, max);
>>
>> /*
>> * Now that the maxMsgNum change is globally visible, we give everyone
>> * a swift kick to make sure they read the newly added messages.
>> * Releasing SInvalWriteLock will enforce a full memory barrier, so
>> * these (unlocked) changes will be committed to memory before we exit
>> * the function.
>> */
>> for (i = 0; i < segP->numProcs; i++)
>> {
>> ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
>>
>> stateP->hasMessages = true;
>> }
>
> On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
> could be reordered with the read of maxMsgNum. Which, afaict, would lead to
> missing messages. That's not prevented by the pg_write_barrier() in
> SIInsertDataEntries(). I think there may be other similar dangers.
>
> This could be solved by adding full memory barriers in a few places.

It is quite interesting remark, because SpinLockAcquire may be implemented
as `__sync_lock_test_and_set(lock, 1)` [1] on ARM/ARM64, and
`__sync_lock_test_and_set` has only "acquire barrier" semantic as GCC's
documentation says [2] . More over, `__sync_lock_release` used in this case
also provides only "release barrier" semantic.

Therefore, concerning these facts, current code state also doesn't give
guarantee `stateP->hasMessage = false` will not be reordered with `max =
segP->maxMsgNum`. Nor following read of messages are guaranteed to happen
after `max = segP->maxMsgNum`.

Given your expectations for SpinLockAcquire and SpinLockRelease to be full
memory barriers, and probably numerous usages of them as such, their
current implementation on ARM/ARM64 looks to be completely broken, imho.

[1]
https://github.com/postgres/postgres/blob/b3754dcc9ff7aba2268e98ecf4b5546353305cd2/src/include/storage/s_lock.h#L244-L264

[2]
https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/_005f_005fsync-Builtins.html#index-_005f_005fsync_005flock_005ftest_005fand_005fset

--
regards
Yura Sokolov aka funny-falcon

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: SpinLockAcquire and SpinLockRelease is broken on ARM/ARM64? (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-06-02 18:20:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Good day, Andres.
Good day, hackers.

05.05.2025 10:30, Yura Sokolov wrote:
> 21.03.2025 19:33, Andres Freund пишет:
>> Hi,
>>
>> On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
>>> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
>>> From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
>>> Date: Mon, 3 Feb 2025 11:58:33 +0300
>>> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>>>
>>> msgnumLock spinlock could be highly contended.
>>> Comment states it was used as memory barrier.
>>> Lets use atomic ops with memory barriers directly instead.
>>>
>>> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
>>> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
>>> no full barrier semantic is required, and explicit read/write barriers
>>> are cheaper at least on x86_64.
>>
>> Is it actually true that full barriers aren't required? I think we might
>> actually rely on a stronger ordering.
>>
>>
>>> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>>> */
>>> stateP->hasMessages = false;
>>>
>>> - /* Fetch current value of maxMsgNum using spinlock */
>>> - SpinLockAcquire(&segP->msgnumLock);
>>> - max = segP->maxMsgNum;
>>> - SpinLockRelease(&segP->msgnumLock);
>>> + /* Fetch current value of maxMsgNum using atomic with memory barrier */
>>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>>> + pg_read_barrier();
>>>
>>> if (stateP->resetState)
>>> {
>>> /*
>>> * Force reset. We can say we have dealt with any messages added
>>> * since the reset, as well; and that means we should clear the
>>> * signaled flag, too.
>>> */
>>> stateP->nextMsgNum = max;
>>> stateP->resetState = false;
>>> stateP->signaled = false;
>>> LWLockRelease(SInvalReadLock);
>>> return -1;
>>> }
>>
>> vs
>>
>>> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>>> /*
>>> * Insert new message(s) into proper slot of circular buffer
>>> */
>>> - max = segP->maxMsgNum;
>>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>>> while (nthistime-- > 0)
>>> {
>>> segP->buffer[max % MAXNUMMESSAGES] = *data++;
>>> max++;
>>> }
>>>
>>> - /* Update current value of maxMsgNum using spinlock */
>>> - SpinLockAcquire(&segP->msgnumLock);
>>> - segP->maxMsgNum = max;
>>> - SpinLockRelease(&segP->msgnumLock);
>>> + /* Update current value of maxMsgNum using atomic with memory barrier */
>>> + pg_write_barrier();
>>> + pg_atomic_write_u32(&segP->maxMsgNum, max);
>>>
>>> /*
>>> * Now that the maxMsgNum change is globally visible, we give everyone
>>> * a swift kick to make sure they read the newly added messages.
>>> * Releasing SInvalWriteLock will enforce a full memory barrier, so
>>> * these (unlocked) changes will be committed to memory before we exit
>>> * the function.
>>> */
>>> for (i = 0; i < segP->numProcs; i++)
>>> {
>>> ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
>>>
>>> stateP->hasMessages = true;
>>> }
>>
>> On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
>> could be reordered with the read of maxMsgNum. Which, afaict, would lead to
>> missing messages. That's not prevented by the pg_write_barrier() in
>> SIInsertDataEntries(). I think there may be other similar dangers.
>>
>> This could be solved by adding full memory barriers in a few places.
>
> It is quite interesting remark, because SpinLockAcquire may be implemented
> as `__sync_lock_test_and_set(lock, 1)` [1] on ARM/ARM64, and
> `__sync_lock_test_and_set` has only "acquire barrier" semantic as GCC's
> documentation says [2] . More over, `__sync_lock_release` used in this case
> also provides only "release barrier" semantic.
>
> Therefore, concerning these facts, current code state also doesn't give
> guarantee `stateP->hasMessage = false` will not be reordered with `max =
> segP->maxMsgNum`. Nor following read of messages are guaranteed to happen
> after `max = segP->maxMsgNum`.
>
> Given your expectations for SpinLockAcquire and SpinLockRelease to be full
> memory barriers, and probably numerous usages of them as such, their
> current implementation on ARM/ARM64 looks to be completely broken, imho.

I recognize my mistake and apologize for noise.

I'll try to describe how I understand it now.
There 4 operations done in two threads:
1m. write maxMsgNum
1h. set hasMessages
2h. unset hasMessages
2m. read maxMsgNum

Synchronization should forbid following orders of operations:
- 1h, 2h, 2m, 1m
- 1h, 2m, 1m, 2h
- 1h, 2m, 2h, 1m
- 2m, 1m, 1h, 2h
- 2m, 1h, 1m, 2h
- 2m, 1h, 2h, 1m
In other words: if read of maxMsgNum (2m) happened before write of
maxMsgNum (1m), then unset hasMessages (2h) should not happen after set of
hasMessages (1h).

Given 1m and 2m are wrapped in spin lock:
- while 1h could be reordered before 1m (since SpinLockRelease is just
release barrier), it could not be reordered before SpinLockAcquire (because
it is acquire barrier).
- same way while 2h could be reordered after 2m , it could not be reordered
with SpinLockRelease.

Introducing acquire and release operations, wrong orders above looks like:
- 1a( 1h, 2a{ 2h, 2m 2r}, 1m 1r)
- 1a( 1h, 2a{ 2m, 1m 1r), 2m 2r}
- 1a( 1h, 2a{ 2m, 2h 2r}, 1m 1r)
- 2a{ 2m, 1a( 1m, 1h 1r), 2h 2r}
- 2a{ 2m, 1a( 1h, 1m 1r), 2h 2r}
- 2a{ 2m, 1a( 1h, 2h 2r}, 1m 1r)
Therefore it is clear all wrong orders are forbidden by the spin lock
despite SpinLockAcquire and SpinLockRelease provides only acquire and
release semantics respectively.

Excuse me once again for misunderstanding this at first time.

But still problem of spin lock contention is here. And we found another one
place which have similar problem: numerous readers fights for single
slock_t slowing down them selves and writer. It because of
XLogRecoveryCtlData->info_lck. When there are a lot of logical/physical
walsenders, it could become noticeable bottleneck.
Part of this bottleneck were due to incorrect logical WAL senders target,
and were fixed in commit 5231ed82 [1]. But still there are measurable
speedup from mitigating spin lock contention at this place.

So I propose to introduce another spin lock type capable for Exclusive and
Shared lock modes (i.e. Write/Read modes) and use it in this two places.

I've tried to implement it with a bits of fairness: both readers and
writers attempters have "waiter" state, which blocks other side. Ie if
reader entered "waiter" state, it blocks concurrent writers (both fresh and
those in "waiter" state). But if writer entered "waiter" state, it blocks
fresh readers (but doesn't block readers, who entered "waiter" state).
Algorithm is self-made, I didn't found exactly same algorithm in the wild.

Patch is attached.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5231ed8262c94936a69bce41f64076630bbd99a2

PS. Another possibility would be to use versionned optimistic lock:
- lock will be implemented as an pg_atomic_uint64 "version" storage,
- writer increments it both in LockWrite and in UnlockWrite. Therefore lock
is locked for write if version is odd.
- reader performs following loop:
-- wait untils version is even and remember it
-- perform read operations
-- if version didn't change, read is successful. Otherwise repeat loop.

Benefit of this approach: reader doesn't perform write to shared memory at all.

But it should use memory barriers before and after read loop to immitate
SpinLock semantic. And since we have no separate "acquire" and "release"
memory fences, it have to be pg_memory_barrier(). Or we should introduce
"acq/rel" fences

Other gotchas is only simple read operations allowed since data could be
changed while they are read (both found usecases satisfies this condition).
And read operations are repeated in a loop until version didn't change
around, so some kind of syntax sugar should be introduced.

This is well known approach, but I don't remember exact name and could not
google it fast.

--
regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v4-0001-Exclusive-Shared-Read-Write-spin-lock.patch	text/x-patch	17.8 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: SpinLockAcquire and SpinLockRelease is broken on ARM/ARM64? (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-06-03 21:04:55
Message-ID:	ksopwquxxtr27apbzxttm4xkw56vhu5xdh7ogfbqrdw4q6qvm2@etdo37vtfoys
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-06-02 21:20:33 +0300, Yura Sokolov wrote:
> But still problem of spin lock contention is here.

I still would like to see a reproducer for this.

> So I propose to introduce another spin lock type capable for Exclusive and
> Shared lock modes (i.e. Write/Read modes) and use it in this two places.

I am vehemently opposed to that. We should work towards getting rid of
spinlocks, not introduce more versions of spinlocks. Userspace spinlocks
largely are a bad idea. We should instead make properly queued locks cheaper
(e.g. by having an exclusive-only lock, which can be cheaper to release on
common platforms).

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-06-16 14:28:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

04.06.2025 00:04, Andres Freund пишет:
> Hi,
>
> On 2025-06-02 21:20:33 +0300, Yura Sokolov wrote:
>> But still problem of spin lock contention is here.
>
> I still would like to see a reproducer for this.

For problem in sinvaladt.c we have no synthetic reproducer. But version
with changed maxMsgNum to atomic solved customer's issue.

For problem on XLogRecoveryCtlData->info_lck we have reproducer which shows
follower with a lot of logical walsenders with patched version lags less
and uses less CPU... but only 4-socket Xeon(R) Gold 6348H. There is no
benefit from patch on 2-socket Intel(R) Xeon(R) Gold 5220R.

>> So I propose to introduce another spin lock type capable for Exclusive and
>> Shared lock modes (i.e. Write/Read modes) and use it in this two places.
>
> I am vehemently opposed to that. We should work towards getting rid of
> spinlocks, not introduce more versions of spinlocks. Userspace spinlocks
> largely are a bad idea. We should instead make properly queued locks cheaper
> (e.g. by having an exclusive-only lock, which can be cheaper to release on
> common platforms).

1. when faster lock will arrive?
2. "exclusive-only" lock will never scale at scenario when there are few
writers and a lot of readers. It is just direct consequence of Amdahl's Law.

To be honestly, our version of XLogRecoveryCtlData->info_lck fix is based
on optimistic read-write lock. I've tried to make more regular read-write
spin lock in previous patch version to make it more familiar.

But after playing a bit with variants using simple C program [1], I found
optimistic lock is only really scalable solution for the problem (aside of
direct use of atomic operations).

So attached version contains optimistic read-write lock used in these two
places.

[1] https://pastebin.com/CwSPkeGi

--
regards
Yura Sokolov aka funny-falcon

Attachment	Content-Type	Size
v5-0001-Read-Write-optimistic-spin-lock.patch	text/x-patch	17.6 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-06-16 14:41:03
Message-ID:	dffbdwhim5ob3cjkyppivd6ll7lncznqrjclpytsdpxqqzyl5o@pejzov3rc25t
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-06-16 17:28:31 +0300, Yura Sokolov wrote:
> 04.06.2025 00:04, Andres Freund пишет:
> > Hi,
> >
> > On 2025-06-02 21:20:33 +0300, Yura Sokolov wrote:
> >> But still problem of spin lock contention is here.
> >
> > I still would like to see a reproducer for this.
>
> For problem in sinvaladt.c we have no synthetic reproducer. But version
> with changed maxMsgNum to atomic solved customer's issue.

TBH, I don't see a point in continuing with this thread without something that
others can test. I rather doubt that the right fix here is to just change the
lock model over, but without a repro I can't evaluate that.

Greetings,

Andres Freund

From:	Sergey Shinderuk <s(dot)shinderuk(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-06-25 13:41:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16.06.2025 17:41, Andres Freund wrote:
> TBH, I don't see a point in continuing with this thread without something that
> others can test. I rather doubt that the right fix here is to just change the
> lock model over, but without a repro I can't evaluate that.

Hello,

I think I can reproduce the issue with pgbench on a muti-core server. I
start a regular select-only test with 64 clients, and while it's
running, I start a plpgsql loop creating and dropping temporary tables
from a single psql session. I observe ~25% drop in tps reported by
pgbench until I cancel the query in psql.

$ pgbench -n -S -c64 -j64 -T300 -P1

progress: 10.0 s, 1249724.7 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 11.0 s, 1248289.0 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 12.0 s, 1246001.0 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 13.0 s, 1247832.5 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 14.0 s, 1248205.8 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 15.0 s, 1247737.3 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 16.0 s, 1219444.3 tps, lat 0.052 ms stddev 0.039, 0 failed
progress: 17.0 s, 893943.4 tps, lat 0.071 ms stddev 0.159, 0 failed
progress: 18.0 s, 927861.3 tps, lat 0.069 ms stddev 0.150, 0 failed
progress: 19.0 s, 886317.1 tps, lat 0.072 ms stddev 0.163, 0 failed
progress: 20.0 s, 877200.1 tps, lat 0.073 ms stddev 0.164, 0 failed
progress: 21.0 s, 875424.4 tps, lat 0.073 ms stddev 0.163, 0 failed
progress: 22.0 s, 877693.0 tps, lat 0.073 ms stddev 0.165, 0 failed
progress: 23.0 s, 897202.8 tps, lat 0.071 ms stddev 0.158, 0 failed
progress: 24.0 s, 917853.4 tps, lat 0.070 ms stddev 0.153, 0 failed
progress: 25.0 s, 907865.1 tps, lat 0.070 ms stddev 0.154, 0 failed

Here I started the following loop in psql around 17s and tps dropped by
~25%:

do $$
begin
for i in 1..1000000 loop
create temp table tt1 (a bigserial primary key, b text);
drop table tt1;
commit;
end loop;
end;
$$;

Now, if I simply remove the spinlock in SIGetDataEntries, I see a drop
of just ~6% under concurrent DDL. I think this strongly suggests that
the spinlock is the bottleneck.

Before that, I tried removing `if (!hasMessages) return` optimization in
SIGetDataEntries to stress the spinlock and observed ~35% drop in tps of
select-only with an empty sinval queue (no DDL running in background).
Then I also removed the spinlock in SIGetDataEntries, and the loss was
just ~4%, which may be noise. I think this also suggests that the
spinlock could be the bottleneck.

I'm running this on a 2 socket AMD EPYC 9654 96-Core server with
postgres and pgbench bound to distinct CPUs. PGDATA is placed on tmpfs.
postgres is running with the default settings. pgbench tables are of
scale 1. pgbench is connecting via loopback/127.0.0.1.

Does this sound convincing?

Best regards,

--
Sergey Shinderuk https://postgrespro.com/

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Sergey Shinderuk <s(dot)shinderuk(at)postgrespro(dot)ru>
Cc:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-07-16 14:58:36
Message-ID:	2xifyljaqxwyjqgacdqgc24syz2syb37lbizshcriurne2cngu@olnlnoxny7pn
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-06-25 16:41:46 +0300, Sergey Shinderuk wrote:
> On 16.06.2025 17:41, Andres Freund wrote:
> > TBH, I don't see a point in continuing with this thread without something that
> > others can test. I rather doubt that the right fix here is to just change the
> > lock model over, but without a repro I can't evaluate that.

> I think I can reproduce the issue with pgbench on a muti-core server. I
> start a regular select-only test with 64 clients, and while it's running, I
> start a plpgsql loop creating and dropping temporary tables from a single
> psql session. I observe ~25% drop in tps reported by pgbench until I cancel
> the query in psql.

>
> $ pgbench -n -S -c64 -j64 -T300 -P1
>
> progress: 10.0 s, 1249724.7 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 11.0 s, 1248289.0 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 12.0 s, 1246001.0 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 13.0 s, 1247832.5 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 14.0 s, 1248205.8 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 15.0 s, 1247737.3 tps, lat 0.051 ms stddev 0.002, 0 failed
> progress: 16.0 s, 1219444.3 tps, lat 0.052 ms stddev 0.039, 0 failed
> progress: 17.0 s, 893943.4 tps, lat 0.071 ms stddev 0.159, 0 failed
> progress: 18.0 s, 927861.3 tps, lat 0.069 ms stddev 0.150, 0 failed
> progress: 19.0 s, 886317.1 tps, lat 0.072 ms stddev 0.163, 0 failed
> progress: 20.0 s, 877200.1 tps, lat 0.073 ms stddev 0.164, 0 failed
> progress: 21.0 s, 875424.4 tps, lat 0.073 ms stddev 0.163, 0 failed
> progress: 22.0 s, 877693.0 tps, lat 0.073 ms stddev 0.165, 0 failed
> progress: 23.0 s, 897202.8 tps, lat 0.071 ms stddev 0.158, 0 failed
> progress: 24.0 s, 917853.4 tps, lat 0.070 ms stddev 0.153, 0 failed
> progress: 25.0 s, 907865.1 tps, lat 0.070 ms stddev 0.154, 0 failed
>
> Here I started the following loop in psql around 17s and tps dropped by
> ~25%:
>
> do $$
> begin
> for i in 1..1000000 loop
> create temp table tt1 (a bigserial primary key, b text);
> drop table tt1;
> commit;
> end loop;
> end;
> $$;

Not a particularly interesting use-case, but still good to have.

> Now, if I simply remove the spinlock in SIGetDataEntries, I see a drop of
> just ~6% under concurrent DDL. I think this strongly suggests that the
> spinlock is the bottleneck.

This can be trivially optimized by just using atomics in a slightly more
heavyweight way than Yura's patch upthread - my criticisim in [1] was purely
that the reduced barriers make the patch incorrect. I don't understand why
Yura didn't just move to full memory barriers, instead choosing to write a RW
optimized spinlock.

[1] https://postgr.es/m/5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus%40wppo6zcfszay

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>, Sergey Shinderuk <s(dot)shinderuk(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-07-16 15:27:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

16.07.2025 17:58, Andres Freund пишет:
> Hi,
>
> On 2025-06-25 16:41:46 +0300, Sergey Shinderuk wrote:
>> On 16.06.2025 17:41, Andres Freund wrote:
>>> TBH, I don't see a point in continuing with this thread without something that
>>> others can test. I rather doubt that the right fix here is to just change the
>>> lock model over, but without a repro I can't evaluate that.
>
>> I think I can reproduce the issue with pgbench on a muti-core server. I
>> start a regular select-only test with 64 clients, and while it's running, I
>> start a plpgsql loop creating and dropping temporary tables from a single
>> psql session. I observe ~25% drop in tps reported by pgbench until I cancel
>> the query in psql.
>
>>
>> $ pgbench -n -S -c64 -j64 -T300 -P1
>>
>> progress: 10.0 s, 1249724.7 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 11.0 s, 1248289.0 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 12.0 s, 1246001.0 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 13.0 s, 1247832.5 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 14.0 s, 1248205.8 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 15.0 s, 1247737.3 tps, lat 0.051 ms stddev 0.002, 0 failed
>> progress: 16.0 s, 1219444.3 tps, lat 0.052 ms stddev 0.039, 0 failed
>> progress: 17.0 s, 893943.4 tps, lat 0.071 ms stddev 0.159, 0 failed
>> progress: 18.0 s, 927861.3 tps, lat 0.069 ms stddev 0.150, 0 failed
>> progress: 19.0 s, 886317.1 tps, lat 0.072 ms stddev 0.163, 0 failed
>> progress: 20.0 s, 877200.1 tps, lat 0.073 ms stddev 0.164, 0 failed
>> progress: 21.0 s, 875424.4 tps, lat 0.073 ms stddev 0.163, 0 failed
>> progress: 22.0 s, 877693.0 tps, lat 0.073 ms stddev 0.165, 0 failed
>> progress: 23.0 s, 897202.8 tps, lat 0.071 ms stddev 0.158, 0 failed
>> progress: 24.0 s, 917853.4 tps, lat 0.070 ms stddev 0.153, 0 failed
>> progress: 25.0 s, 907865.1 tps, lat 0.070 ms stddev 0.154, 0 failed
>>
>> Here I started the following loop in psql around 17s and tps dropped by
>> ~25%:
>>
>> do $$
>> begin
>> for i in 1..1000000 loop
>> create temp table tt1 (a bigserial primary key, b text);
>> drop table tt1;
>> commit;
>> end loop;
>> end;
>> $$;
>
> Not a particularly interesting use-case, but still good to have.
>
>
>
>> Now, if I simply remove the spinlock in SIGetDataEntries, I see a drop of
>> just ~6% under concurrent DDL. I think this strongly suggests that the
>> spinlock is the bottleneck.
>
> This can be trivially optimized by just using atomics in a slightly more
> heavyweight way than Yura's patch upthread - my criticisim in [1] was purely
> that the reduced barriers make the patch incorrect. I don't understand why
> Yura didn't just move to full memory barriers, instead choosing to write a RW
> optimized spinlock.
>
> [1] https://postgr.es/m/5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus%40wppo6zcfszay

Excuse me, but I did at [1]. Not all barriers were changed to full, but one
necessary at the place you'd pointed to.

But you just answered then [2]:
> I don't believe we have the whole story here.

[1]
https://www.postgresql.org/message-id/b798eb5e-35b7-40b5-a245-4170deab56f8%40postgrespro.ru
[2]
https://www.postgresql.org/message-id/4yfgfeu3f6f7fayl5h3kggtd5bkvm4gj3a3ryjs2qhlo6g74bt%40g3cu2u4pgaiw

I moved to RW optimized spinlock because I found second place where it
helps. And patch includes both those places.
I think it is better to have good reusable tool applied in two places
instead of two custom pieces of code. And RWOptSpin is almost as good as
direct use of atomic operations.

--
regards
Yura Sokolov aka funny-falcon

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Sergey Shinderuk <s(dot)shinderuk(at)postgrespro(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-07-16 15:54:52
Message-ID:	ewyrvzzhw2mdg3vol2pfjbrjcotmlvyrkxivkyfnkxtepihjrp@pgdpww3kmso6
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-07-16 18:27:45 +0300, Yura Sokolov wrote:
> 16.07.2025 17:58, Andres Freund пишет:
> >> Now, if I simply remove the spinlock in SIGetDataEntries, I see a drop of
> >> just ~6% under concurrent DDL. I think this strongly suggests that the
> >> spinlock is the bottleneck.
> >
> > This can be trivially optimized by just using atomics in a slightly more
> > heavyweight way than Yura's patch upthread - my criticisim in [1] was purely
> > that the reduced barriers make the patch incorrect. I don't understand why
> > Yura didn't just move to full memory barriers, instead choosing to write a RW
> > optimized spinlock.
> >
> > [1] https://postgr.es/m/5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus%40wppo6zcfszay
>
> Excuse me, but I did at [1]. Not all barriers were changed to full, but one
> necessary at the place you'd pointed to.

Then you went on to focus on a completely different approach.

> But you just answered then [2]:
> > I don't believe we have the whole story here.

That was not in reply to the changed patch, but about the performance numbers
you relayed. We had no repro, and even with the repro that Sergey has now
delivered, we don't see similar levels of what you reported as contention.

I still think that workloads bottlenecked by SI consumption could be improved
a lot more by focusing on higher level improvements than on the spinlock.

> [1]
> https://www.postgresql.org/message-id/b798eb5e-35b7-40b5-a245-4170deab56f8%40postgrespro.ru
> [2]
> https://www.postgresql.org/message-id/4yfgfeu3f6f7fayl5h3kggtd5bkvm4gj3a3ryjs2qhlo6g74bt%40g3cu2u4pgaiw
>
> I moved to RW optimized spinlock because I found second place where it
> helps. And patch includes both those places.
> I think it is better to have good reusable tool applied in two places
> instead of two custom pieces of code. And RWOptSpin is almost as good as
> direct use of atomic operations.

Shrug. I don't think it makes sense here, given that the spinlock literally
just protects a single variable and that the most straightforward conversion
to atomics is like a 10 line change.

If somebody else feels different, they can pursue merging RWOptSpin, but I
won't. I think it's going 100% in the wrong direction.

Greetings,

Andres Freund

From:	Olga Antonova <o(dot)antonova(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	Sergey Shinderuk <s(dot)shinderuk(at)postgrespro(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)
Date:	2025-08-22 07:40:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 7/16/25 18:54, Andres Freund wrote:
> That was not in reply to the changed patch, but about the performance numbers
> you relayed. We had no repro, and even with the repro that Sergey has now
> delivered, we don't see similar levels of what you reported as contention.
We investigated this issue in detail and were able to reproduce the
spinlock contention in SIGetDataEntries. The problem is most evident on
multiprocessor systems with multiple NUMA nodes, but it also occurs on a
single node, albeit less pronounced. This is probably also the case for
high-frequency CPU.

We ran tests on two bare-metal servers:

4 NUMA nodes × 24 CPUs Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz.
PostgreSQL was running on 3 nodes (72 CPUs).

2 NUMA nodes × 32 CPUs Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz.
PostgreSQL was running on a single node (32 CPUs).

and two PostgreSQL builds: from master branch and the with the patch
v5-0001-Read-Write-optimistic-spin-lock.patch.

To generate frequent cache invalidations, we executed a background
workload that repeatedly created and dropped temporary tables with
indexes in a loop.

do $$
begin
   for i in 1..1000000 loop
     create temp table tt1 (
    f0 bigserial primary key,
    f1 int,
    f2 int,
        f3 int,
        f4 int,
        f5 int,
        f6 int,
        f7 int,
        f8 int,
        f9 int,
        f10 int);
     CREATE INDEX ON tt1(f1);
     CREATE INDEX ON tt1(f2);
     CREATE INDEX ON tt1(f3);
     CREATE INDEX ON tt1(f4);
     CREATE INDEX ON tt1(f5);
     CREATE INDEX ON tt1(f6);
     CREATE INDEX ON tt1(f7);
     CREATE INDEX ON tt1(f8);
     CREATE INDEX ON tt1(f9);
     CREATE INDEX ON tt1(f10);
     drop table tt1;
     commit;
   end loop;
end;
$$;

As a benchmark, we used a pgbench select-only scenario with 64 clients:

pgbench -U postgres -c 64 -j 32 -T 200 -s 100 -M prepared -b select-only
postgres -n

For convenience, the test is included as test.sh (attached), with
description and setup instructions provided in the README.

During the test, we ran perf for 10 seconds using the command

perf record -F 99 -a -g --call-graph=dwarf -o perf_data sleep 10.

Аnd then generated flame graphs from the collected data

1. Three NUMA nodes (72 CPUs)

According to the flame graph (fg_3numa_nopatch.xml), about 34% of
exec_bind_message is spent in SIGetDataEntries, >90% of which is
spinlock wait (see fg_3numa_nopatch.xml).

With the patch the share of SIGetDataEntries decreases to ~6.6%, the
main waiting shifts to LWLockAcquire, and RWOptSpinReadStart accounts
for only ~1.1% (fg_3numa_patch.xml). TPS improvement: +6–8% (over 5 runs).

Without patch: TPS = 731171.336542
With patch: TPS = 786077.155196

2. Single NUMA node (32 CPUs)

In this case the problem is less pronounced, but still SIGetDataEntries
takes 10.1% of exec_bind_message, of which 82.3% is spinlock wait
(fg_1numa_nopatch.xml).

With the patch we observed a stable 1.5–2% TPS increase (5 runs).

Without patch: TPS = 518941.051825
With patch: TPS = 528768.641836

The flame graph does not show absolute time, but the relative
distribution confirms contention on the spinlock in SIGetDataEntries.
The problem exists and is a bottleneck under high load, especially on
multiprocessor NUMA systems. The patch mitigates this contention and
improves performance.

---
Best regards,
Olga Antonova

Attachment	Content-Type	Size
test.sh	application/x-shellscript	1.3 KB
README	text/plain	943 bytes
fg_3numa_patch.xml	text/xml	651.1 KB
fg_3numa_nopatch.xml	text/xml	662.6 KB
fg_1numa_nopatch.xml	text/xml	718.7 KB