Quick Links

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

Lists:	pgsql-bugs

From:	PG Bug reporting form <noreply(at)postgresql(dot)org>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc:	exclusion(at)gmail(dot)com
Subject:	BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-04 11:00:01
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18014
Logged by: Alexander Lakhin
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 16beta2
Operating system: Ubuntu 22.04
Description:

Yesterday's test failure on prion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
made me wonder, what's going on there and whether it's yet another issue
with invalidating relcache (bug #17994).
(
SELECT schema_to_xmlschema('testxmlschema', false, true, '');
ERROR: relation with OID 29598 does not exist
CONTEXT: SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
relnamespace = 29597 AND relkind IN ('r','m','v') AND
pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

Other failures of that kind:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2023-06-20%2001%3A56%3A04&stg=check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2023-04-15%2017%3A17%3A09&stg=check
)

I managed to construct a simple reproducer for the error:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';
SELECT schema_to_xmlschema('testxmlschema_$c', true, false, '');

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

With a server compiled as follows:
CPPFLAGS="-O0 -DCATCACHE_FORCE_RELEASE" ./configure -q --enable-debug
--enable-cassert --enable-tap-tests --with-libxml && make ...
(More precisely, "#ifndef CATCACHE_FORCE_RELEASE" in ReleaseCatCache()
does matter here.)

I get errors as in the test in question:
...
ITERATION 9
ITERATION 10
ERROR: relation with OID 59777 does not exist
CONTEXT: parallel worker
SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE relnamespace =
57162 AND relkind IN ('r','m','v') AND pg_catalog.has_table_privilege (oid,
'SELECT') ORDER BY relname;"
2023-07-04 12:48:14.205 MSK [3111661] ERROR: relation with OID 59777 does
not exist
2023-07-04 12:48:14.206 MSK [3111598] ERROR: relation with OID 59777 does
not exist

With a debug logging added in src/backend/utils/adt/acl.c, I see that
SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid) returns true in
has_table_privilege_id(), but later, in
pg_class_aclcheck()/pg_class_aclmask_ext(),
SearchSysCache1(RELOID, ObjectIdGetDatum(table_oid)) returns NULL.

This is reproduced on REL_10_STABLE .. master.
The first commit that demonstrates the issue is 61c2e1a95 (it improved
access to parallelism for SPI users, one of which is
schema_to_xmlschema_internal() (see also schema_get_xml_visible_tables())).

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-16 20:00:01
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

04.07.2023 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference: 18014
> Logged by: Alexander Lakhin
> Email address: exclusion(at)gmail(dot)com
> PostgreSQL version: 16beta2
> Operating system: Ubuntu 22.04
> Description:
>
> Yesterday's test failure on prion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
> made me wonder, what's going on there and whether it's yet another issue
> with invalidating relcache (bug #17994).
> (
> SELECT schema_to_xmlschema('testxmlschema', false, true, '');
> ERROR: relation with OID 29598 does not exist
> CONTEXT: SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
> relnamespace = 29597 AND relkind IN ('r','m','v') AND
> pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

I investigated this case and would like to share my findings.
I added in has_table_privilege_id(), just below
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
PG_RETURN_NULL();

the following loop:
for (int i = 0; i < 100; i++) {
bool sce = SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid));
if (!sce)
elog(LOG, "has_table_privilege_id(): no syscache entry on iteration %d", i);
break;
}
}

and discovered that when the reproducing script uses parallel worker(s), the
syscache entry disappears during this loop execution. But that's not
happening when the query "SELECT oid FROM pg_catalog.pg_class WHERE ..."
is executed in a regular backend.
AFAICS, the difference is in the LockRelationOid():
res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

    /*
     * Now that we have the lock, check for invalidation messages, so that we
     * will update or flush any stale relcache entry before we try to use it.
     * RangeVarGetRelid() specifically relies on us for this. We can skip
     * this in the not-uncommon case that we already had the same type of lock
     * being requested, since then no one else could have modified the
     * relcache entry in an undesirable way. (In the case where our own xact
     * modifies the rel, the relcache update happens via
     * CommandCounterIncrement, not here.)
     *
     * However, in corner cases where code acts on tables (usually catalogs)
     * recursively, we might get here while still processing invalidation
     * messages in some outer execution of this function or a sibling. The
     * "cleared" status of the lock tells us whether we really are done
     * absorbing relevant inval messages.
     */
    if (res != LOCKACQUIRE_ALREADY_CLEAR)
    {
        AcceptInvalidationMessages();
        MarkLockClear(locallock);
    }
when LockRelationOid() is called for pg_class_oid_index inside
SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open().

The parallel worker doesn't have a lock on pg_class_oid_index before
executing the query, so it gets the lock and res == LOCKACQUIRE_OK (not
LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
processes invalidation messages (this can make the backend use a newer
catalog snapshot), and at the end it does systable_endscan() ->
index_close() -> UnlockRelationId() -> LockRelease()...
Thus, on a next iteration it gets the lock anew, with the res == LOCKACQUIRE_OK
again, and all that ceremony repeated.

It's not clear to me, whether this parallel worker behavior is expected and
if so, what to fix to avoid the test failure.

Best regards,
Alexander

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	exclusion(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-20 07:28:02
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Sun, 16 Jul 2023 23:00:01 +0300, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote in
> The parallel worker doesn't have a lock on pg_class_oid_index before
> executing the query, so it gets the lock and res == LOCKACQUIRE_OK
> (not
> LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
> processes invalidation messages (this can make the backend use a newer
> catalog snapshot), and at the end it does systable_endscan() ->
> index_close() -> UnlockRelationId() -> LockRelease()...
> Thus, on a next iteration it gets the lock anew, with the res ==
> LOCKACQUIRE_OK
> again, and all that ceremony repeated.
>
> It's not clear to me, whether this parallel worker behavior is
> expected and
> if so, what to fix to avoid the test failure.

That is, the function is not parallel-safe. In fact it is marked as
'r' in pg_proc.proparallel. So, the real question appears to be how it
ended up running in a paralell worker.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	exclusion(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-20 07:44:59
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Thu, 20 Jul 2023 16:28:02 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> That is, the function is not parallel-safe. In fact it is marked as
> 'r' in pg_proc.proparallel. So, the real question appears to be how it
> ended up running in a paralell worker.

Stupid. What we should do here would be ensuring the funtion doesn't
invoke parallel workers, maybe.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	exclusion(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-20 08:29:24
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Thu, 20 Jul 2023 16:44:59 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> Stupid. What we should do here would be ensuring the funtion doesn't
> invoke parallel workers, maybe.

So, for testing, I didn't see the error with applying the attached patch.
There are other SPI calls but I didn't see them at all.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
inhibit_parallel_in_xmlfuncs_poc.diff	text/x-patch	911 bytes

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-21 09:00:01
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Hello Horiguchi-san,

20.07.2023 11:29, Kyotaro Horiguchi wrote:
> At Thu, 20 Jul 2023 16:44:59 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>> Stupid. What we should do here would be ensuring the funtion doesn't
>> invoke parallel workers, maybe.
> So, for testing, I didn't see the error with applying the attached patch.
> There are other SPI calls but I didn't see them at all.
>

Thank you for looking into it!

I think that we need to determine the level where the problem that should
be fixed is:
1) test xmlmap fails sporadically due to the catalog changes caused by
parallel tests activity
2) schema_to_xmlschemaX() can fail when parallel workers are used
3) has_table_privilegeX() can fail sporadically when executed within a
parallel worker
4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
when repeated in a parallel worker

The patch proposed fixes the issue on level 2, but that error still can be
seen on level 3. For example, you can replace
SELECT schema_to_xmlschema('testxmlschema_$c', true, false, '');
in the above script with
SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND relkind IN ('r', 'm', 'v') AND
pg_catalog.has_table_privilege (oid, 'SELECT');

Probably, similar errors can also occur with
"SELECT * FROM information_schema.tables" or with other
information_schema views, that call has_table_privilege().

But if it's okey to have sporadic errors on lower levels, then may be just
fix the issue on level 1. Please look at the patch attached, that makes
schema_get_xml_visible_tables(Oid nspid) isolated from changes in other
namespaces.

Best regards,
Alexander

Attachment	Content-Type	Size
isolate-schema_get_xml_visible_tables.patch	text/x-patch	884 bytes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-21 17:20:55
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> I think that we need to determine the level where the problem that should
> be fixed is:
> 1) test xmlmap fails sporadically due to the catalog changes caused by
> parallel tests activity
> 2) schema_to_xmlschemaX() can fail when parallel workers are used
> 3) has_table_privilegeX() can fail sporadically when executed within a
> parallel worker
> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
> when repeated in a parallel worker

Yeah, that's not immediately obvious. IIUC, the situation we are
looking at is that SearchSysCacheExists can succeed even though the
tuple we found is already dead at the instant that the function
exits (thanks to absorption of inval messages during relation_close).
The fact that that only happens in parallel workers is pure chance
really. It is not okay for has_table_privilegeX to depend on the
fact that the surrounding query already has some lock on pg_class.
So this means that the approach has_table_privilegeX uses of
assuming that successful SearchSysCacheExists means it can call
pg_class_aclcheck without fear is just broken.

If we suppose that that assumption is only being made in the
has_foo_privilege functions, then one way we could fix it is to extend
the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
and get rid of the separate SearchSysCacheExists check. However,
I can't avoid the suspicion that we have other places assuming the
same thing. So I think what we really ought to be doing is one
of two things:

1. Hack SearchSysCacheExists to account for this issue, by making it
loop if it finds a syscache entry but sees that the entry is already
dead. (We have to loop, not just return false, in case the row was
updated rather than deleted.) Maybe all the syscache lookup
functions need to do likewise; it's certainly not intuitively
reasonable for them to return already-known-stale entries.

2. Figure out how come we are executing a cache inval on the way
out of syscache entry creation, and stop that from happening.

I like #2 better if it's not hard to do cleanly. However, I'm not
quite sure how we are getting to an inval during relation close;
maybe that's not something we want to prevent.

regards, tom lane

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-21 19:00:01
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Hello Tom,

Thank you for your considerations!

21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>> parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>> parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>> when repeated in a parallel worker
> Yeah, that's not immediately obvious. IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really. It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check. However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.

Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:

* ... The underlying IsVisible functions
* always use an up-to-date snapshot and so might see the object as already
* gone when it's still visible to the transaction snapshot. (There is no race
* condition in the current coding because we don't accept sinval messages
* between the SearchSysCacheExists test and the subsequent lookup.)
*/

Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
Oid oid = PG_GETARG_OID(0);

if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
PG_RETURN_NULL();

PG_RETURN_BOOL(RelationIsVisible(oid));
}

...

bool
RelationIsVisible(Oid relid)
{
    HeapTuple   reltup;
    Form_pg_class relform;
    Oid         relnamespace;
    bool        visible;

    reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
    if (!HeapTupleIsValid(reltup))
        elog(ERROR, "cache lookup failed for relation %u", relid);

So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.

> So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead. (We have to loop, not just return false, in case the row was
> updated rather than deleted.) Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.

I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.

> I like #2 better if it's not hard to do cleanly. However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.

Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)

Best regards,
Alexander

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-21 19:21:46
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> 21.07.2023 20:20, Tom Lane wrote:
>> I like #2 better if it's not hard to do cleanly. However, I'm not
>> quite sure how we are getting to an inval during relation close;
>> maybe that's not something we want to prevent.

> Yes, there is a detailed comment in LockRelationOid(), that explains why
> AcceptInvalidationMessages() is called. (I've tried to remove that call
> now, just for testing, and get 6 tests failed during `make check`.)

Yes, we certainly want to do that during LockRelationOid. But what
seems to be happening here is an inval while we are closing/unlocking
the catalog we got the syscache entry from. That is, the expected
behavior here is:

SearchSysCacheExists:

* is entry present-and-valid?
No, so...

* open and lock relevant catalog (with possible inval)

* scan catalog, find desired row, create valid syscache entry

* close and unlock catalog

* return success

SearchSysCache1 (from pg_class_aclmask_ext):

* is entry present-and-valid?
Yes, so increment its refcount and return it

There is no inval in the entry-already-present code path in syscache
lookup. So if we are seeing this failure, ISTM it must mean that an
inval is happening during "close and unlock catalog", which seems like
something that we don't want. But I've not traced exactly how that
happens.

regards, tom lane

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-25 10:00:00
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Hi Tom,

21.07.2023 22:21, Tom Lane wrote:
> Yes, we certainly want to do that during LockRelationOid. But what
> seems to be happening here is an inval while we are closing/unlocking
> the catalog we got the syscache entry from. That is, the expected
> behavior here is:
>
> SearchSysCacheExists:
>
> * is entry present-and-valid?
> No, so...
>
> * open and lock relevant catalog (with possible inval)
>
> * scan catalog, find desired row, create valid syscache entry
>
> * close and unlock catalog
>
> * return success
>
> SearchSysCache1 (from pg_class_aclmask_ext):
>
> * is entry present-and-valid?
> Yes, so increment its refcount and return it
>
> There is no inval in the entry-already-present code path in syscache
> lookup. So if we are seeing this failure, ISTM it must mean that an
> inval is happening during "close and unlock catalog", which seems like
> something that we don't want. But I've not traced exactly how that
> happens.

Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to config_env
on prion), so the cache entry, that was just found in
SearchSysCacheExists(), is removed immediately because of
SearchSysCacheExists() -> ReleaseSysCache(tuple) -> ReleaseCatCache(tuple).

So, while the construction "if (SearchSysCacheExists()) ... SearchSysCache1()"
seems robust for normal conditions, it might be broken when catcache entries
released forcefully. Thus, if the worst consequence of the issue is sporadic
test failures on prion, then may be fix it in a least invasive way (on level 1).

Best regards,
Alexander

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	exclusion(at)gmail(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 02:29:23
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote in
> Hi Tom,
>
> 21.07.2023 22:21, Tom Lane wrote:
> > Yes, we certainly want to do that during LockRelationOid. But what
> > seems to be happening here is an inval while we are closing/unlocking
> > the catalog we got the syscache entry from. That is, the expected
> > behavior here is:
> >
> > SearchSysCacheExists:
> >
> > * is entry present-and-valid?
> > No, so...
> >
> > * open and lock relevant catalog (with possible inval)
> >
> > * scan catalog, find desired row, create valid syscache entry
> >
> > * close and unlock catalog
> >
> > * return success
> >
> > SearchSysCache1 (from pg_class_aclmask_ext):
> >
> > * is entry present-and-valid?
> > Yes, so increment its refcount and return it
> >
> > There is no inval in the entry-already-present code path in syscache
> > lookup. So if we are seeing this failure, ISTM it must mean that an
> > inval is happening during "close and unlock catalog", which seems like
> > something that we don't want. But I've not traced exactly how that
> > happens.
>
> Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to
> config_env
> on prion), so the cache entry, that was just found in
> SearchSysCacheExists(), is removed immediately because of
> SearchSysCacheExists() -> ReleaseSysCache(tuple) ->
> ReleaseCatCache(tuple).
>
> So, while the construction "if (SearchSysCacheExists())
> ... SearchSysCache1()"
> seems robust for normal conditions, it might be broken when catcache

I agree about the safety of the construct.

> entries
> released forcefully. Thus, if the worst consequence of the issue is
> sporadic
> test failures on prion, then may be fix it in a least invasive way (on
> level 1).

> 1) test xmlmap fails sporadically due to the catalog changes caused by
> parallel tests activity
> 2) schema_to_xmlschemaX() can fail when parallel workers are used

> 3) has_table_privilegeX() can fail sporadically when executed within a
> parallel worker

Doesn't this imply that the function isn't parallel-safe? The issue is
gone by marking it and all variants as parallel-restricted. It seems
to be a reasolable way to address this issue.

> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
> when repeated in a parallel worker

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 02:41:57
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> writes:
> At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote in
>> 21.07.2023 22:21, Tom Lane wrote:
>>> There is no inval in the entry-already-present code path in syscache
>>> lookup. So if we are seeing this failure, ISTM it must mean that an
>>> inval is happening during "close and unlock catalog", which seems like
>>> something that we don't want. But I've not traced exactly how that
>>> happens.

>> Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to config_env
>> on prion), so the cache entry, that was just found in
>> SearchSysCacheExists(), is removed immediately because of
>> SearchSysCacheExists() -> ReleaseSysCache(tuple) ->
>> ReleaseCatCache(tuple).

Oh! So are you saying that this case cannot happen in the wild
(that is, in a non-cache-clobbering build)? If so, I think that
there's a good case to be made that the cache-clobbering behavior
is too strict, and we should change that (not sure just how) rather
than complicating callers that are perfectly safe in reality.

> Doesn't this imply that the function isn't parallel-safe? The issue is
> gone by marking it and all variants as parallel-restricted.

As I said earlier, I think that's a purely coincidental "fix" for
this specific manifestation. Either SearchSysCacheExists followed
by a syscache lookup of the same tuple should be considered safe,
or it shouldn't. If it should be considered safe, we need to fix the
cache-clobber test scaffolding to not give a false positive. While if
it shouldn't, we need to get rid of that coding pattern, not apply
high-level band-aids that remove just one particular path to reaching
the problem. I'm not dead set on either answer at this point, but
I think those are the plausible alternatives.

regards, tom lane

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	exclusion(at)gmail(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 02:53:40
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Wed, 26 Jul 2023 11:29:23 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote in
> > 3) has_table_privilegeX() can fail sporadically when executed within a
> > parallel worker
>
> Doesn't this imply that the function isn't parallel-safe? The issue is
> gone by marking it and all variants as parallel-restricted. It seems
> to be a reasolable way to address this issue.

Mmm. This may give somewhat different impression from my intention.

I meant that it seems that the function may return different values on
different workers for the same parameter. This means the function is
not "stable" or "stable but parallel-unsafe". I think the latter is
true.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 02:59:56
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Tue, 25 Jul 2023 22:41:57 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> writes:
> > Doesn't this imply that the function isn't parallel-safe? The issue is
> > gone by marking it and all variants as parallel-restricted.
>
> As I said earlier, I think that's a purely coincidental "fix" for
> this specific manifestation. Either SearchSysCacheExists followed
> by a syscache lookup of the same tuple should be considered safe,
> or it shouldn't. If it should be considered safe, we need to fix the

(So, it came to this after all..)

Yeah, as I posted at the same time, what I meant is not that the
sequence is unsafe. It is safe even on a parallel worker. What I
meant was that there seems to be a case where it returns different
result for the same parameters if they are called on different
workers.

> cache-clobber test scaffolding to not give a false positive. While if
> it shouldn't, we need to get rid of that coding pattern, not apply
> high-level band-aids that remove just one particular path to reaching
> the problem. I'm not dead set on either answer at this point, but
> I think those are the plausible alternatives.

Right.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 03:32:38
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Tue, Jul 25, 2023 at 10:41:57PM -0400, Tom Lane wrote:
> As I said earlier, I think that's a purely coincidental "fix" for
> this specific manifestation. Either SearchSysCacheExists followed
> by a syscache lookup of the same tuple should be considered safe,
> or it shouldn't. If it should be considered safe, we need to fix the
> cache-clobber test scaffolding to not give a false positive. While if
> it shouldn't, we need to get rid of that coding pattern, not apply
> high-level band-aids that remove just one particular path to reaching
> the problem. I'm not dead set on either answer at this point, but
> I think those are the plausible alternatives.

FWIW, I'm having a hard time thinking about a reason that we should
not support SearchSysCacheExists()+lookup to be a valid pattern, even
if the cache is clobbered. I am pretty sure that there are other code
paths in the tree, not mentioned on this thread, that do exactly that
(haven't checked, but indexcmds.c is one coming in mind).
--
Michael

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 03:54:33
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> FWIW, I'm having a hard time thinking about a reason that we should
> not support SearchSysCacheExists()+lookup to be a valid pattern, even
> if the cache is clobbered. I am pretty sure that there are other code
> paths in the tree, not mentioned on this thread, that do exactly that
> (haven't checked, but indexcmds.c is one coming in mind).

Yeah. As I see it, there are several questions here.

First, is the SearchSysCacheExists()+lookup pattern actually unsafe,
that is is there a way to break it without any cache-clobbering debug
options enabled? If so, there's no question we have to get rid of it.

If it isn't really unsafe, then we have a choice whether to get rid of it
or adjust the cache-clobbering options to not fail.

I think the main argument for keeping it is exactly that we've probably
depended on the idea in multiple places, and finding them all might be
hard (and not re-introducing more later, even harder).

On the other hand, I have to concede that such coding patterns are
inherently fragile: you can't introduce any new opportunities for a
cache inval between the SearchSysCacheExists() and the lookup, or
there's definitely a real hazard --- which we might not find for
awhile, if this example is anything to go by.

These has_foo_privilege() functions are arguably a good example of the
least safe way to go about it, in that the SearchSysCacheExists and
the subsequent lookup aren't textually close or even in the same
module. Nor do we have any comments warning against introducing more
logic into the critical code paths.

So another conclusion we could arrive at is that the pattern isn't
inherently unsafe, but we should only allow it when there's not much
code between the two calls, which would probably lead to wanting to
rewrite the has_foo_privilege() family after all.

I don't yet have an opinion about which way to go.

regards, tom lane

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-26 08:36:00
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

At Wed, 26 Jul 2023 11:59:56 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> Yeah, as I posted at the same time, what I meant is not that the
> sequence is unsafe. It is safe even on a parallel worker. What I
> meant was that there seems to be a case where it returns different
> result for the same parameters if they are called on different
> workers.

Okey, I'm convinced that that won't happen, due to command counter,
which was missing in the above.

Sorry for the noise.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-08-15 23:34:02
Message-ID:	CAKU4AWpqGMLLSDxUZg8RojUaBGUomFp3QmYo-b7zX02q8zxmRA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Hi:

> I think that we need to determine the level where the problem that should
> be fixed is:
>

Since this thread has not been updated in the last 20 days, I'm not
sure if any people are still working on this. If not, I think considering
reducing the level of the fix is an option, to avoid the unstable testcase.

> Please look at the patch attached, that makes
> schema_get_xml_visible_tables(Oid nspid) isolated from changes in other
> namespaces.
>

I read the patch and it looks good and the change is pretty mild. so
personally +1 for this solution. Another solution suggested by me
is to force the query to go with index scan for the test case only [1],
which we get the similar result but it just works in this test case.
I think Alexander's method is better, I just put my findings here just
in case I missed something.

[1]
https://www.postgresql.org/message-id/CAKU4AWqdh2FNv5JVap7QACj0e_a8eP%3DskPeOQittZ%2Bmaaqj9zQ%40mail.gmail.com

--
Best Regards
Andy Fan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-12 21:01:27
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

I wrote:
> Michael Paquier <michael(at)paquier(dot)xyz> writes:
>> FWIW, I'm having a hard time thinking about a reason that we should
>> not support SearchSysCacheExists()+lookup to be a valid pattern, even
>> if the cache is clobbered. I am pretty sure that there are other code
>> paths in the tree, not mentioned on this thread, that do exactly that
>> (haven't checked, but indexcmds.c is one coming in mind).

> Yeah. As I see it, there are several questions here.

> First, is the SearchSysCacheExists()+lookup pattern actually unsafe,
> that is is there a way to break it without any cache-clobbering debug
> options enabled? If so, there's no question we have to get rid of it.

I came back to this question today, and after more thought I'm going
to cast my vote for "let's not trust SearchSysCacheExists()+lookup".
It just seems like that's too fragile. The case reported in this
thread of it failing in parallel workers but not main should scare
anybody, and the future course of development seems far more likely
to introduce new problems than remove them.

I spent some time looking through existing SearchSysCacheExists calls,
and I could only find two sets of routines where we seem to be
depending on SearchSysCacheExists to protect a subsequent lookup
somewhere else, and there isn't any lock on the object in question.
Those are the has_foo_privilege functions discussed here, and the
foo_is_visible functions near the bottom of namespace.c. I'm not
sure why we've not heard complaints traceable to the foo_is_visible
family. Maybe nobody has tried hard to break them, or maybe they
are just less likely to be used in ways that are at risk.

It turns out to not be that hard to get rid of the problem in the
has_foo_privilege family, as per attached patches. I've not looked
at fixing the foo_is_visible family, but that probably ought to be a
separate patch anyway.

BTW, while nosing around I found what seems like a very nasty related
bug. Suppose that a catalog tuple being loaded into syscache contains
some toasted fields. CatalogCacheCreateEntry will flatten the tuple,
involving fetches from toast tables that will certainly cause
AcceptInvalidationMessages calls. What if one of those should have
invalidated this tuple? We will not notice, because it's not in
the hashtable yet. When we do add it, we will mark it not-dead,
meaning that the stale entry looks fine and could persist for a long
while.

Anyway, as to the attached patches: I split this into two parts
just to make it easier to review. 0001 deals with the functions
that use pg_class_aclcheck and related, while 0002 deals with the
ones that go through object_aclcheck. In both cases, we're
basically extending the API convention somebody added awhile ago
that foo_aclcheck and foo_aclmask should have extended versions
foo_aclcheck_ext and foo_aclmask_ext that take an "is_missing"
argument. That wasn't terribly well documented IMO, so 0001
starts out with massaging the comments to make it clearer.
I also changed some existing places in pg_attribute_aclmask_ext and
pg_attribute_aclcheck_all to conform to the is_missing convention,
rather than hard-wiring a decision that it's okay to ignore lookup
failure.

pg_attribute_aclcheck_all also had a most curious decision that
it could call pg_attribute_aclmask, which not only opens it right
back up to potential lookup failure but is quite inefficient,
requiring in three syscache lookups instead of one for each
column having a non-null attacl. It just takes a few more lines
to call aclmask() directly.

0001's changes in acl.c are straightforward, but it's worth noting
that the has_sequence_privilege functions hadn't gotten the memo
about not failing when a bogus relation OID is passed.

0002 is pretty straightforward as well, just adding an "_ext"
version of object_aclcheck/object_aclmask and using those
where appropriate.

It looks like 0001 could be back-patched as far as v14 without
too much trouble. Before that, there isn't pg_class_aclcheck_ext,
and I'm not sure it's worth the trouble to add. 0002 will only
work in HEAD and v16 because object_aclcheck wasn't there earlier.
I'm not sure whether to bother back-patching at all, given that
we've only seen this problem in test builds.

Comments?

regards, tom lane

Attachment	Content-Type	Size
v1-0001-fix-table-privilege-checks.patch	text/x-diff	14.5 KB
v1-0002-fix-other-privilege-checks.patch	text/x-diff	23.5 KB

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-13 15:00:00
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Hello Tom,

13.10.2023 00:01, Tom Lane wrote:
> I came back to this question today, and after more thought I'm going
> to cast my vote for "let's not trust SearchSysCacheExists()+lookup".
> It just seems like that's too fragile. The case reported in this
> thread of it failing in parallel workers but not main should scare
> anybody, and the future course of development seems far more likely
> to introduce new problems than remove them.

Thanks for digging into this!

> I spent some time looking through existing SearchSysCacheExists calls,
> and I could only find two sets of routines where we seem to be
> depending on SearchSysCacheExists to protect a subsequent lookup
> somewhere else, and there isn't any lock on the object in question.
> Those are the has_foo_privilege functions discussed here, and the
> foo_is_visible functions near the bottom of namespace.c. I'm not
> sure why we've not heard complaints traceable to the foo_is_visible
> family. Maybe nobody has tried hard to break them, or maybe they
> are just less likely to be used in ways that are at risk.

I'll try to research/break xxx_is_visible and share my findings tomorrow.

> It turns out to not be that hard to get rid of the problem in the
> has_foo_privilege family, as per attached patches. I've not looked
> at fixing the foo_is_visible family, but that probably ought to be a
> separate patch anyway.

Yeah, the attached patches greatly improve consistency. The only
inconsistency I've found in the patches is a missing comma here:
+ * Exported generic routine for checking a user's access privileges to an object
+ * with is_missing

You removed "this case is not a user-facing error, so elog not ereport",
and I think it's good for consistency too, as all aclmask/aclcheck
functions now use ereport().

> BTW, while nosing around I found what seems like a very nasty related
> bug. Suppose that a catalog tuple being loaded into syscache contains
> some toasted fields. CatalogCacheCreateEntry will flatten the tuple,
> involving fetches from toast tables that will certainly cause
> AcceptInvalidationMessages calls. What if one of those should have
> invalidated this tuple? We will not notice, because it's not in
> the hashtable yet. When we do add it, we will mark it not-dead,
> meaning that the stale entry looks fine and could persist for a long
> while.

Yeah, it's an interesting case that needs investigation, IMO.
I'll try to look into this and construct the test case in the background.

> 0001's changes in acl.c are straightforward, but it's worth noting
> that the has_sequence_privilege functions hadn't gotten the memo
> about not failing when a bogus relation OID is passed.

I've looked through all functions has_*priv*_id and all they have
"if (is_missing) PG_RETURN_NULL()" now (with the patches applied).

Best regards,
Alexander

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-14 07:00:00
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

13.10.2023 18:00, Alexander Lakhin wrote:

>
>> I spent some time looking through existing SearchSysCacheExists calls,
>> and I could only find two sets of routines where we seem to be
>> depending on SearchSysCacheExists to protect a subsequent lookup
>> somewhere else, and there isn't any lock on the object in question.
>> Those are the has_foo_privilege functions discussed here, and the
>> foo_is_visible functions near the bottom of namespace.c. I'm not
>> sure why we've not heard complaints traceable to the foo_is_visible
>> family. Maybe nobody has tried hard to break them, or maybe they
>> are just less likely to be used in ways that are at risk.
>
> I'll try to research/break xxx_is_visible and share my findings tomorrow.
>

I tried the script based on the initial reproducer [1]:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';

SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND
relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

And couldn't get the error, for multiple runs. (Here SELECT oid ... is
based on the query executed by schema_to_xmlschema().)
But I could reliably get the error with
s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
So there is a difference between these two functions. And the difference is
in their costs.
If I do "ALTER FUNCTION pg_table_is_visible COST 1" before the script,
I get the error as expected.
With cost 10 I see the following plan:
Index Scan using pg_class_relname_nsp_index on pg_class (cost=0.42..2922.38 rows=1 width=4)
Index Cond: (relnamespace = '1'::oid)
Filter: ((relkind = ANY ('{r,m,v}'::"char"[])) AND pg_table_is_visible(oid))

But with cost 1:
Gather (cost=1.00..257.10 rows=1 width=4)
   Workers Planned: 2
   Workers Launched: 2
   -> Parallel Seq Scan on pg_class (cost=0.00..256.00 rows=1 width=4)
         Filter: (pg_table_is_visible(oid) AND (relnamespace = '1'::oid) AND (relkind = ANY ('{r,m,v}'::"char"[])))
         Rows Removed by Filter: 405

The cost of the pg_foo_is_visible functions was increased in a80889a73.
But all the has_xxx_privilige functions have cost 1, except for
has_any_column_privilege, which cost was also increased in 7449427a1.

So to see the issue we need several ingredients:
1) The mode CATCACHE_FORCE_RELEASE enabled (may be some other way is
possible, I don't know of);
    - Thanks to prion for that.
2) A function with the coding pattern
"SearchSysCacheExistsX(); SearchSysCacheX();" called in a parallel worker;
    - Thanks to "debug_parallel_query = regress" and low cost of
      has_table_privilege() called by schema_to_xmlschema().
3) The catalog cache invalidated by some concurrent activity.
    - Thanks to running the test xmlmap in parallel with 16 other tests.

[1] https://www.postgresql.org/message-id/18014-28c81cb79d44295d%40postgresql.org

Best regards,
Alexander

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-14 16:35:57
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> 13.10.2023 18:00, Alexander Lakhin wrote:
>> I'll try to research/break xxx_is_visible and share my findings tomorrow.

> I tried the script based on the initial reproducer [1]:
> ...
> And couldn't get the error, for multiple runs. (Here SELECT oid ... is
> based on the query executed by schema_to_xmlschema().)
> But I could reliably get the error with
> s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
> So there is a difference between these two functions. And the difference is
> in their costs.

Ah, thanks for poking at it. I believe the reason for the cost issue
is that your query already has a very selective indexable condition,
so it tends not to bother with a parallel scan. I removed the
relnamespace condition:

SELECT oid FROM pg_catalog.pg_class WHERE -- relnamespace = 1 AND
relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

and then I get a parallel plan without messing with the cost, and it
falls over almost immediately (in a CATCACHE_FORCE_RELEASE build,
anyway).

ITERATION 1
ITERATION 2
ERROR: cache lookup failed for relation 208139
CONTEXT: parallel worker
ERROR: cache lookup failed for relation 208471
CONTEXT: parallel worker
2023-10-14 12:30:24.747 EDT [1762290] ERROR: cache lookup failed for relation 208139
2023-10-14 12:30:24.747 EDT [1762266] ERROR: cache lookup failed for relation 208139
2023-10-14 12:30:24.782 EDT [1762310] ERROR: cache lookup failed for relation 208471
2023-10-14 12:30:24.782 EDT [1762252] ERROR: cache lookup failed for relation 208471

So we do need to fix that.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-14 20:20:09
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

I wrote:
> So we do need to fix that.

I've fixed both sets of functions as of now. We still need to look
into the question of whether detoasting syscache entries is safe.

regards, tom lane

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-10-20 11:00:00
Message-ID:	[email protected]
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

14.10.2023 23:20, Tom Lane wrote:
> I've fixed both sets of functions as of now. We still need to look
> into the question of whether detoasting syscache entries is safe.

Unfortunately, the answer is "no". Please look at the proof:
https://www.postgresql.org/message-id/18163-859bad19a43edcf6%40postgresql.org

Best regards,
Alexander