Random pg_upgrade 004_subscription test failure on drongo

Lists: pgsql-hackers
From: vignesh C <vignesh21(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-13 09:04:15
Message-ID: CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2].

The test failure can happen because of the reason mentioned at [3] and [6].
I'm just requoting the contents from there, if you have checked [3]
and [6] skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35
[2] - https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] - https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com

Regards,
Vignesh

Attachment Content-Type Size
0001-Fix-an-intermetant-BF-failure-in-004_subscription.patch application/octet-stream 1.6 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-13 12:19:31
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2025-03-13 Th 5:04 AM, vignesh C wrote:
> Hi,
>
> In one of the recent buildfarm runs, the test
> pg_upgrade/004_subscription test fails at [1].
> It has failed with the following:
> Restoring database schemas in the new cluster
> *failure*
>
> Consult the last few lines of
> "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
> for
> the probable cause of the failure.
> Failure, exiting
>
> Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
> to see what is the exact reason.

That's not supposed to happen. I am testing a fix to see if I can make
it collect the logs, but for now we'll have to wait till the next failure ..

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-13 12:40:50
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13/03/2025 11:04, vignesh C wrote:
> ## Analysis
> I think it was caused due to the STATUS_DELETE_PENDING failure, not
> related with recent
> updates for pg_upgrade.
>
> The file "base/1/2683" is an index file for
> pg_largeobject_loid_pn_index, and the
> output meant that file creation failed. Below is a backtrace.
>
> ```
> pgwin32_open() // <-- this returns -1
> open()
> BasicOpenFilePerm()
> PathNameOpenFilePerm()
> PathNameOpenFile()
> mdcreate()
> smgrcreate()
> RelationCreateStorage()
> RelationSetNewRelfilenumber()
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
>
> But this is strange. Before calling mdcreate(), we surely unlink the
> file which have the same name. Below is a trace until unlink.
>
> ```
> pgunlink()
> unlink()
> mdunlinkfork()
> mdunlink()
> smgrdounlinkall()
> RelationSetNewRelfilenumber() // common path with above
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
>
> I found Thomas said that [4] pgunlink sometimes could not remove a
> file even if it returns OK, at that time NTSTATUS is
> STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
> mentions the same thing:
>
> ```
> /*
> * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
> * gone (Windows NT status code is STATUS_DELETE_PENDING). In that
> * case, we'd better ask for the NT status too so we can translate it
> * to a more Unix-like error. We hope that nothing clobbers the NT
> * status in between the internal NtCreateFile() call and CreateFile()
> * returning.
> *
> ```
>
> The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
> that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
> if the deletion is pending but it is trying to open.
> ---------------------------------------------
>
> This was fixed by the following change in the target upgrade nodes:
> bgwriter_lru_maxpages = 0
> checkpoint_timeout = 1h
>
> Attached is a patch in similar lines for 004_subscription.

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly. In dropdb() and
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

--
Heikki Linnakangas
Neon (https://neon.tech)


From: vignesh C <vignesh21(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-17 10:19:23
Message-ID: CALDaNm2LzW9iQB1=Q5PhMUvhzvYFeGkJFO+3tD8+r6bKS3whoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> Hmm, this problem isn't limited to this one pg_upgrade test, right? It
> could happen with any pg_upgrade invocation. And perhaps in a running
> server too, if a relfilenumber is reused quickly.

Yes, it can happen with these scenarios.

> In dropdb() and DropTableSpace() we do this:
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>
> Should we do the same here? Not sure where exactly to put that; perhaps
> in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

I felt mdcreate would be a good place to fix this issue.

Regards,
Vignesh


From: vignesh C <vignesh21(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-21 13:24:31
Message-ID: CALDaNm2y+nf-V9tjKwvbPprobZs1t_UrcCpJ0qYD5-KkOUFAyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
>
> Hmm, this problem isn't limited to this one pg_upgrade test, right? It
> could happen with any pg_upgrade invocation. And perhaps in a running
> server too, if a relfilenumber is reused quickly. In dropdb() and
> DropTableSpace() we do this:
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>
> Should we do the same here? Not sure where exactly to put that; perhaps
> in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

How about a patch similar to the attached one? I have run pg_upgrade
tests multiple times, but unfortunately, I was unable to reproduce the
issue or verify these changes.

Regards,
Vignesh

Attachment Content-Type Size
v1-0001-Fix-issue-with-file-handle-retention-during-CREAT.patch application/octet-stream 2.4 KB