Consolidate 'unique array values' logic into a reusable function?

Lists: pgsql-hackers
From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Consolidate 'unique array values' logic into a reusable function?
Date: 2016-08-07 04:45:39
Message-ID: CAEepm=2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Looking at commits f10eab73d and c50d192c, I wondered why we don't
have a reusable in-place unique function. It may be trivial, but we
seem to have a lot of copies and variations in the tree.

Here's a sketch patch that creates a function array_unique which takes
the same arguments as qsort or qsort_arg and returns the new length.
The patch replaces all the specialised unique functions and open coded
versions that I could find with simple greps, but there are probably
more.

My compiler seems to inline the comparator function and memcpy well,
so I can't measure any speed difference between array_unique(array,
size, sizeof(int), compare_int) and a hand-crafted loop using == for
comparison and = for assignment, for a billion items.

If no one objects I'll post a version of this to a commitfest, along
with some other trivial code duplication refactoring work I posted a
while back that consolidates popcount and ffs/fls implementations. I
don't like code duplication :-)

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
array-unique.patch application/octet-stream 16.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2016-08-07 15:52:22
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> Here's a sketch patch that creates a function array_unique which takes
> the same arguments as qsort or qsort_arg and returns the new length.

Hmm ... I'd be against using this in backend/regex/, because I still
have hopes of converting that to a standalone library someday (and
in any case it needs to stay compatible with Tcl's copy of the code).
But otherwise this seems like a reasonable proposal.

As for the function name, maybe "qunique()" to go with "qsort()"?
I'm not thrilled with "array_unique" because that sounds like it
is meant for Postgres' array data types.

regards, tom lane


From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2019-08-30 03:34:52
Message-ID: CA+hUKGK_kwiS+3VCeMMGAKg=27T1v17ABzt+xDa1qeW7W7wruA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I'm reviving a thread from 2016, because I wanted this thing again today.

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> > Here's a sketch patch that creates a function array_unique which takes
> > the same arguments as qsort or qsort_arg and returns the new length.
>
> Hmm ... I'd be against using this in backend/regex/, because I still
> have hopes of converting that to a standalone library someday (and
> in any case it needs to stay compatible with Tcl's copy of the code).
> But otherwise this seems like a reasonable proposal.
>
> As for the function name, maybe "qunique()" to go with "qsort()"?
> I'm not thrilled with "array_unique" because that sounds like it
> is meant for Postgres' array data types.

OK, here it is renamed to qunique() and qunique_arg(). It's a bit odd
because it has nothing to do with the quicksort algorithm, but make
some sense because it's always used with qsort(). I suppose we could
save a few more lines if there were a qsort_unique() function that
does both, since the arguments are identical. I also moved it into a
new header lib/qunique.h. Any better ideas for where it should live?
I removed the hunk under regex.

One thing I checked is that on my system it is inlined along with the
comparator when that is visible, so no performance should be lost by
throwing away the open coded versions. This makes me think that eg
oid_cmp() should probably be defined in a header; clearly we're also
carrying a few functions that should be consolidated into a new
int32_cmp() function, somewhere, too. (It might also be interesting
to use the pg_attribute_always_inline trick to instantiate some common
qsort() specialisations for a bit of speed-up, but that's another
topic.)

Adding to CF.

--
Thomas Munro
https://enterprisedb.com

Attachment Content-Type Size
0001-Consolidate-code-that-makes-a-sorted-array-unique.patch application/x-patch 16.5 KB

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2019-09-09 23:43:09
Message-ID: CA+hUKGLz=6eDPewLb6ToPLcg1R3brH+5eBiWW0WVyn7+fh5y=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2019 at 3:34 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Adding to CF.

Rebased due to bitrot. Spotted one more place to use this, in
src/backend/utils/adt/txid.c.

--
Thomas Munro
https://enterprisedb.com

Attachment Content-Type Size
0001-Consolidate-code-that-makes-a-sorted-array-unique-v2.patch application/octet-stream 17.5 KB

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2019-11-03 23:02:21
Message-ID: CA+hUKG+7kZ0nVrL8HeVyP+a2K64oLq921K4S4LCubGuw0JWuHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 10, 2019 at 11:43 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Rebased due to bitrot. Spotted one more place to use this, in
> src/backend/utils/adt/txid.c.

Rebased. I'm planning to commit this soon.

Attachment Content-Type Size
0001-Consolidate-code-that-makes-a-sorted-array-unique-v3.patch application/octet-stream 17.5 KB

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2019-12-29 07:02:21
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 04, 2019 at 12:02:21PM +1300, Thomas Munro wrote:
> Rebased. I'm planning to commit this soon.

In each installcheck-parallel run under valgrind-3.14.0, I now see ~1200
reports like this:

==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
==00:00:00:28.322 1527557== at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
==00:00:00:28.322 1527557== by 0xA9A57B: qunique (qunique.h:34)
==00:00:00:28.322 1527557== by 0xA9A843: InitCatalogCache (syscache.c:1056)
==00:00:00:28.322 1527557== by 0xAB6B18: InitPostgres (postinit.c:682)
==00:00:00:28.322 1527557== by 0x91F98E: PostgresMain (postgres.c:3909)
==00:00:00:28.322 1527557== by 0x872DE9: BackendRun (postmaster.c:4498)
==00:00:00:28.322 1527557== by 0x8725B3: BackendStartup (postmaster.c:4189)
==00:00:00:28.322 1527557== by 0x86E7F4: ServerLoop (postmaster.c:1727)
==00:00:00:28.322 1527557== by 0x86E0AA: PostmasterMain (postmaster.c:1400)
==00:00:00:28.322 1527557== by 0x77CB56: main (main.c:210)
==00:00:00:28.322 1527557==
{
<insert_a_suppression_name_here>
Memcheck:Overlap
fun:memcpy@@GLIBC_2.14
fun:qunique
fun:InitCatalogCache
fun:InitPostgres
fun:PostgresMain
fun:BackendRun
fun:BackendStartup
fun:ServerLoop
fun:PostmasterMain
fun:main
}

This is like the problem fixed in 9a9473f; the precedent from there would be
to test src!=dst before calling mempcy(), e.g. as attached. I suppose the
alternative would be to add a suppression like the one 9a9473f removed.

I do wonder why the Valgrind buildfarm animals haven't noticed.

Attachment Content-Type Size
qunique-valgrind-v1.patch text/x-diff 951 bytes

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2020-01-08 01:49:48
Message-ID: CA+hUKGJKGpniqYVNaOf0NJnah2vG1xcZG3jRkQu1CDYegghfXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 29, 2019 at 8:02 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> ==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
> ==00:00:00:28.322 1527557== at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
> ==00:00:00:28.322 1527557== by 0xA9A57B: qunique (qunique.h:34)
> ==00:00:00:28.322 1527557== by 0xA9A843: InitCatalogCache (syscache.c:1056)
> ==00:00:00:28.322 1527557== by 0xAB6B18: InitPostgres (postinit.c:682)
> ==00:00:00:28.322 1527557== by 0x91F98E: PostgresMain (postgres.c:3909)
> ==00:00:00:28.322 1527557== by 0x872DE9: BackendRun (postmaster.c:4498)
> ==00:00:00:28.322 1527557== by 0x8725B3: BackendStartup (postmaster.c:4189)
> ==00:00:00:28.322 1527557== by 0x86E7F4: ServerLoop (postmaster.c:1727)
> ==00:00:00:28.322 1527557== by 0x86E0AA: PostmasterMain (postmaster.c:1400)
> ==00:00:00:28.322 1527557== by 0x77CB56: main (main.c:210)
> ==00:00:00:28.322 1527557==
> {
> <insert_a_suppression_name_here>
> Memcheck:Overlap
> fun:memcpy@@GLIBC_2.14
> fun:qunique
> fun:InitCatalogCache
> fun:InitPostgres
> fun:PostgresMain
> fun:BackendRun
> fun:BackendStartup
> fun:ServerLoop
> fun:PostmasterMain
> fun:main
> }
>
> This is like the problem fixed in 9a9473f; the precedent from there would be
> to test src!=dst before calling mempcy(), e.g. as attached. I suppose the
> alternative would be to add a suppression like the one 9a9473f removed.

Thanks for fixing that.

> I do wonder why the Valgrind buildfarm animals haven't noticed.

Optimisation levels? For example, I see that skink is using -Og, at
which level my local GCC inlines qunique() and memcpy() so that in the
case you quoted there's a MOV instruction and valgrind has nothing to
complain about.


From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Consolidate 'unique array values' logic into a reusable function?
Date: 2020-01-12 20:59:59
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 08, 2020 at 02:49:48PM +1300, Thomas Munro wrote:
> On Sun, Dec 29, 2019 at 8:02 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > ==00:00:00:28.322 1527557== Source and destination overlap in memcpy(0x1000104, 0x1000104, 4)
> > ==00:00:00:28.322 1527557== at 0x4C2E74D: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
> > ==00:00:00:28.322 1527557== by 0xA9A57B: qunique (qunique.h:34)
> > ==00:00:00:28.322 1527557== by 0xA9A843: InitCatalogCache (syscache.c:1056)
> > ==00:00:00:28.322 1527557== by 0xAB6B18: InitPostgres (postinit.c:682)
> > ==00:00:00:28.322 1527557== by 0x91F98E: PostgresMain (postgres.c:3909)
> > ==00:00:00:28.322 1527557== by 0x872DE9: BackendRun (postmaster.c:4498)
> > ==00:00:00:28.322 1527557== by 0x8725B3: BackendStartup (postmaster.c:4189)
> > ==00:00:00:28.322 1527557== by 0x86E7F4: ServerLoop (postmaster.c:1727)
> > ==00:00:00:28.322 1527557== by 0x86E0AA: PostmasterMain (postmaster.c:1400)
> > ==00:00:00:28.322 1527557== by 0x77CB56: main (main.c:210)
> > ==00:00:00:28.322 1527557==
> > {
> > <insert_a_suppression_name_here>
> > Memcheck:Overlap
> > fun:memcpy@@GLIBC_2.14
> > fun:qunique
> > fun:InitCatalogCache
> > fun:InitPostgres
> > fun:PostgresMain
> > fun:BackendRun
> > fun:BackendStartup
> > fun:ServerLoop
> > fun:PostmasterMain
> > fun:main
> > }
> >
> > This is like the problem fixed in 9a9473f; the precedent from there would be
> > to test src!=dst before calling mempcy(), e.g. as attached. I suppose the
> > alternative would be to add a suppression like the one 9a9473f removed.
>
> Thanks for fixing that.

Glad to help.

> > I do wonder why the Valgrind buildfarm animals haven't noticed.
>
> Optimisation levels? For example, I see that skink is using -Og, at
> which level my local GCC inlines qunique() and memcpy() so that in the
> case you quoted there's a MOV instruction and valgrind has nothing to
> complain about.

That explains it. I would have been using -O0. (It would be a nice bonus to
have both an -O0 Valgrind buildfarm animal and an optimized Valgrind animal,
since neither catches everything.)