Quick Links

Re: Index Skip Scan

Lists:	pgsql-hackers

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Index Skip Scan
Date:	2018-06-18 15:25:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all,

I would like to start a discussion on Index Skip Scan referred to as
Loose Index Scan in the wiki [1].

My use-case is the simplest form of Index Skip Scan (B-Tree only),
namely going from

CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
HashAggregate (cost=169247.71..169247.74 rows=3 width=4) (actual
time=4104.099..4104.099 rows=3 loops=1)
Output: b
Group Key: t1.b
Buffers: shared hit=44248
-> Seq Scan on public.t1 (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.059..1050.376 rows=10000000 loops=1)
Output: a, b
Buffers: shared hit=44248
Planning Time: 0.157 ms
Execution Time: 4104.155 ms
(9 rows)

CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3
width=4) (actual time=0.061..0.137 rows=3 loops=1)
Output: b
Heap Fetches: 3
Buffers: shared hit=13
Planning Time: 0.155 ms
Execution Time: 0.170 ms
(6 rows)

I took Thomas Munro's previous patch [2] on the subject, added a GUC, a
test case, documentation hooks, minor code cleanups, and made the patch
pass an --enable-cassert make check-world run. So, the overall design is
the same.

However, as Robert Haas noted in the thread there are issues with the
patch as is, especially in relationship to the amcanbackward functionality.

A couple of questions to begin with.

Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
a new node (T_IndexSkipScan) be created ? If latter, then there likely
will be functionality that needs to be refactored into shared code
between the nodes.

Which is the best way to deal with the amcanbackward functionality ? Do
people see another alternative to Robert's idea of adding a flag to the
scan.

I wasn't planning on making this a patch submission for the July
CommitFest due to the reasons mentioned above, but can do so if people
thinks it is best. The patch is based on master/4c8156.

Any feedback, suggestions, design ideas and help with the patch in
general is greatly appreciated.

Thanks in advance !

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2]
https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Best regards,
Jesper

Attachment	Content-Type	Size
wip_indexskipscan.patch	text/x-patch	30.5 KB

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-18 17:31:48
Message-ID:	CAPpHfducUiDw7hVT=Vui8rxTmawCkVc0H--v=jdbztMUGhzuRg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> I would like to start a discussion on Index Skip Scan referred to as
> Loose Index Scan in the wiki [1].

Great, I glad to see you working in this!

> However, as Robert Haas noted in the thread there are issues with the
> patch as is, especially in relationship to the amcanbackward functionality.
>
> A couple of questions to begin with.
>
> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
> a new node (T_IndexSkipScan) be created ? If latter, then there likely
> will be functionality that needs to be refactored into shared code
> between the nodes.

Is skip scan only possible for index-only scan? I guess, that no. We
could also make plain index scan to behave like a skip scan. And it
should be useful for accelerating DISTINCT ON clause. Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan. So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller. But I don't insist on that, and I would like to hear
other opinions too.

> Which is the best way to deal with the amcanbackward functionality ? Do
> people see another alternative to Robert's idea of adding a flag to the
> scan.

Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient. So, it seems to not worth
implementing, at least in the initial version. And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together? What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.

> I wasn't planning on making this a patch submission for the July
> CommitFest due to the reasons mentioned above, but can do so if people
> thinks it is best. The patch is based on master/4c8156.

Please, register it on commitfest. If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To:	jesper(dot)pedersen(at)redhat(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Index Skip Scan
Date:	2018-06-18 20:20:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
> Hi all,
>
> I would like to start a discussion on Index Skip Scan referred to as
> Loose Index Scan in the wiki [1].

awesome

>
>
>
> I wasn't planning on making this a patch submission for the July
> CommitFest due to the reasons mentioned above, but can do so if people
> thinks it is best. T

New large features are not appropriate for the July CF. September should
be your goal.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-18 21:06:59
Message-ID:	CAPpHfduUPwxODxJv5VWSTdfZYRqOHQJ2KTcf=GfH4GZ3NkdB+A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan
<andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
> > I wasn't planning on making this a patch submission for the July
> > CommitFest due to the reasons mentioned above, but can do so if people
> > thinks it is best. T
>
> New large features are not appropriate for the July CF. September should
> be your goal.

Assuming this, should we have possibility to register patch to
September CF from now?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-19 01:40:02
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:
> Assuming this, should we have possibility to register patch to
> September CF from now?

There cannot be two commit fests marked as open at the same time as
Magnus mentions here:
https://www.postgresql.org/message-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com

There is no issue in registering new patches in future ones for admins,
but normal users cannot, right? In this case, could you wait that the
next CF is marked as in progress and that the one of September is
opened? You could also add it to the July one if you are not willing to
wait, and that will get bumped by one of the CFMs, but this makes the
whole process unnecessary noisy.
--
Michael

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-19 10:01:24
Message-ID:	CA+q6zcVsH-iOnyzwRg_EuUCW1nfwjibhPZoo78iPqX9C8zjwJg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 18 June 2018 at 19:31, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>>
>> A couple of questions to begin with.
>>
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
>
> Is skip scan only possible for index-only scan? I guess, that no. We
> could also make plain index scan to behave like a skip scan. And it
> should be useful for accelerating DISTINCT ON clause. Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan. So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller. But I don't insist on that, and I would like to hear
> other opinions too.

In one of patches I'm working on I had similar situation, when I wanted to
split one node into two similar nodes (before I just extended it) and logically
it made perfect sense. But it turned out to be quite useless and the advantage
I've got wasn't worth it - and just to mention, those nodes had more differences
than in this patch. So I agree that probably it would be better to keep using
IndexOnlyScan.

> On 19 June 2018 at 03:40, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:
>> Assuming this, should we have possibility to register patch to
>> September CF from now?
>
> There cannot be two commit fests marked as open at the same time as
> Magnus mentions here:
> https://www.postgresql.org/message-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com
>
> In this case, could you wait that the next CF is marked as in progress and
> that the one of September is opened?

Yep, since the next CF will start shortly that's the easiest thing to do.

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-19 17:00:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander,

On 06/18/2018 01:31 PM, Alexander Korotkov wrote:
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
>
> Is skip scan only possible for index-only scan? I guess, that no. We
> could also make plain index scan to behave like a skip scan. And it
> should be useful for accelerating DISTINCT ON clause. Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan. So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller. But I don't insist on that, and I would like to hear
> other opinions too.
>

Yes, there are likely other use-cases for Index Skip Scan apart from the
simplest form. Which sort of suggests that having dedicated nodes would
be better in the long run.

My goal is to cover the simplest form, which can be handled by extending
the T_IndexOnlyScan node, or by having common functions that both use.
We can always improve the functionality with future patches.

>> Which is the best way to deal with the amcanbackward functionality ? Do
>> people see another alternative to Robert's idea of adding a flag to the
>> scan.
>
> Supporting amcanbackward seems to be basically possible, but rather
> complicated and not very efficient. So, it seems to not worth
> implementing, at least in the initial version. > And then the question
> should how index access method report that it supports both skip scan
> and backward scan, but not both together? What about turning
> amcanbackward into a function which takes (bool skipscan) argument?
> Therefore, this function will return whether backward scan is
> supported depending of whether skip scan is used.
>

The feedback from Robert Haas seems to suggest that it was a requirement
for the patch to be considered.

>> I wasn't planning on making this a patch submission for the July
>> CommitFest due to the reasons mentioned above, but can do so if people
>> thinks it is best. The patch is based on master/4c8156.
>
> Please, register it on commitfest. If even there wouldn't be enough
> of time for this patch on July commitfest, it's no problem to move it.
>

Based on the feedback from Andrew and Michael I won't register this
thread until the September CF.

Thanks for your feedback !

Best regards,
Jesper

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-06-19 17:06:08
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Dmitry,

On 06/19/2018 06:01 AM, Dmitry Dolgov wrote:
>> On 18 June 2018 at 19:31, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> Is skip scan only possible for index-only scan? I guess, that no. We
>> could also make plain index scan to behave like a skip scan. And it
>> should be useful for accelerating DISTINCT ON clause. Thus, we might
>> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
>> IndexOnlySkipScan. So, I don't think I like index scan nodes to
>> multiply this way, and it would be probably better to keep number
>> nodes smaller. But I don't insist on that, and I would like to hear
>> other opinions too.
>
> In one of patches I'm working on I had similar situation, when I wanted to
> split one node into two similar nodes (before I just extended it) and logically
> it made perfect sense. But it turned out to be quite useless and the advantage
> I've got wasn't worth it - and just to mention, those nodes had more differences
> than in this patch. So I agree that probably it would be better to keep using
> IndexOnlyScan.
>

I looked at this today, and creating a new node (T_IndexOnlySkipScan)
would make the patch more complex.

The question is if the patch should create such a node such that future
patches didn't have to deal with refactoring to a new node to cover
additional functionality.

Thanks for your feedback !

Best regards,
Jesper

From:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 05:44:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Jesper,

I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possible that index skip scan can address it as well?

postgres=# explain verbose select b from t1 group by b;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Group (cost=97331.29..97332.01 rows=3 width=4)
Output: b
Group Key: t1.b
-> Gather Merge (cost=97331.29..97331.99 rows=6 width=4)
Output: b
Workers Planned: 2
-> Sort (cost=96331.27..96331.27 rows=3 width=4)
Output: b
Sort Key: t1.b
-> Partial HashAggregate (cost=96331.21..96331.24 rows=3 width=4)
Output: b
Group Key: t1.b
-> Parallel Seq Scan on public.t1 (cost=0.00..85914.57 rows=4166657 width=4)
Output: a, b
(14 rows)

Time: 1.167 ms

— And here is the original example
postgres=# explain verbose SELECT DISTINCT b FROM t1;
QUERY PLAN
-------------------------------------------------------------------------------
Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4)
Output: b
(2 rows)

Time: 0.987 ms

> On Jun 18, 2018, at 10:31 AM, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
> Hi!
>
> On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>> I would like to start a discussion on Index Skip Scan referred to as
>> Loose Index Scan in the wiki [1].
>
> Great, I glad to see you working in this!
>
>> However, as Robert Haas noted in the thread there are issues with the
>> patch as is, especially in relationship to the amcanbackward functionality.
>>
>> A couple of questions to begin with.
>>
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
>
> Is skip scan only possible for index-only scan? I guess, that no. We
> could also make plain index scan to behave like a skip scan. And it
> should be useful for accelerating DISTINCT ON clause. Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan. So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller. But I don't insist on that, and I would like to hear
> other opinions too.
>
>> Which is the best way to deal with the amcanbackward functionality ? Do
>> people see another alternative to Robert's idea of adding a flag to the
>> scan.
>
> Supporting amcanbackward seems to be basically possible, but rather
> complicated and not very efficient. So, it seems to not worth
> implementing, at least in the initial version. And then the question
> should how index access method report that it supports both skip scan
> and backward scan, but not both together? What about turning
> amcanbackward into a function which takes (bool skipscan) argument?
> Therefore, this function will return whether backward scan is
> supported depending of whether skip scan is used.
>
>> I wasn't planning on making this a patch submission for the July
>> CommitFest due to the reasons mentioned above, but can do so if people
>> thinks it is best. The patch is based on master/4c8156.
>
> Please, register it on commitfest. If even there wouldn't be enough
> of time for this patch on July commitfest, it's no problem to move it.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 06:22:25
Message-ID:	CAEepm=1ikXHVgSNruQ1HydDZm3OLnLNb5qT2c8S=Mh-VsFQXcQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 16, 2018 at 5:44 PM, Bhushan Uparkar
<bhushan(dot)uparkar(at)gmail(dot)com> wrote:
> I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possible that index skip scan can address it as well?

Yeah, there are a few tricks you can do with "index skip scans"
(Oracle name, or as IBM calls them, "index jump scans"... I was
slightly tempted to suggest we call ours "index hop scans"...). For
example:

* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)

You're right that a very simple GROUP BY can be equivalent to a
DISTINCT query, but I'm not sure if it's worth recognising that
directly or trying to implement the more general grouping trick that
can handle MIN/MAX, and whether that should be the same executor
node... The idea of starting with DISTINCT was just that it's
comparatively easy. We should certainly try to look ahead and bear
those features in mind when figuring out the interfaces though. Would
the indexam skip(scan, direction, prefix_size) operation I proposed be
sufficient? Is there a better way?

I'm glad to see this topic come back!

--
Thomas Munro
http://www.enterprisedb.com

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 18:23:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Bhushan,

On 08/16/2018 01:44 AM, Bhushan Uparkar wrote:
> I was reviewing index-skip patch example and have a comment on it.

Thanks for your interest, and feedback on this patch !

> Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form > of query it came up with different plan, is it possible that index skip scan can address it as well?
>

Like Thomas commented down-thread my goal is to keep this contribution
as simple as possible in order to get to something that can be
committed. Improvements can follow in future CommitFests, which may end
up in the same release.

However, as stated in my original mail my goal is the simplest form of
Index Skip Scan (or whatever we call it). I welcome any help with the patch.

Best regards,
Jesper

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 18:28:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Thomas,

On 08/16/2018 02:22 AM, Thomas Munro wrote:
> The idea of starting with DISTINCT was just that it's
> comparatively easy. We should certainly try to look ahead and bear
> those features in mind when figuring out the interfaces though. Would
> the indexam skip(scan, direction, prefix_size) operation I proposed be
> sufficient? Is there a better way?
>

Yeah, I'm hoping that a Committer can provide some feedback on the
direction that this patch needs to take.

One thing to consider is the pluggable storage patch, which is a lot
more important than this patch. I don't want this patch to get in the
way of that work, so it may have to wait a bit in order to see any new
potential requirements.

> I'm glad to see this topic come back!
>

You did the work, and yes hopefully we can get closer to this subject in
12 :)

Best regards,
Jesper

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 18:36:02
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Jesper Pedersen (jesper(dot)pedersen(at)redhat(dot)com) wrote:
> On 08/16/2018 02:22 AM, Thomas Munro wrote:
> >The idea of starting with DISTINCT was just that it's
> >comparatively easy. We should certainly try to look ahead and bear
> >those features in mind when figuring out the interfaces though. Would
> >the indexam skip(scan, direction, prefix_size) operation I proposed be
> >sufficient? Is there a better way?
>
> Yeah, I'm hoping that a Committer can provide some feedback on the direction
> that this patch needs to take.

Thomas is one these days. :)

At least on first glance, that indexam seems to make sense to me, but
I've not spent a lot of time thinking about it. Might be interesting to
ask Peter G about it though.

> One thing to consider is the pluggable storage patch, which is a lot more
> important than this patch. I don't want this patch to get in the way of that
> work, so it may have to wait a bit in order to see any new potential
> requirements.

Not sure where this came from, but I don't think it's particularly good
to be suggesting that one feature is more important than another or that
we need to have one wait for another as this seems to imply. I'd
certainly really like to see PG finally have skipping scans, for one
thing, and it seems like with some effort that might be able to happen
for v12. I'm not convinced that we're going to get pluggable storage to
happen in v12 and I don't really agree with recommending that people
hold off on making improvements to things because it's coming.

Thanks!

Stephen

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 18:50:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Stephen,

On 08/16/2018 02:36 PM, Stephen Frost wrote:
>> Yeah, I'm hoping that a Committer can provide some feedback on the direction
>> that this patch needs to take.
>
> Thomas is one these days. :)
>

I know :) However, there are some open questions from Thomas' original
submission that still needs to be ironed out.

> At least on first glance, that indexam seems to make sense to me, but
> I've not spent a lot of time thinking about it. Might be interesting to
> ask Peter G about it though.
>

Yes, or Anastasia who also have done a lot of work on nbtree/.

>> One thing to consider is the pluggable storage patch, which is a lot more
>> important than this patch. I don't want this patch to get in the way of that
>> work, so it may have to wait a bit in order to see any new potential
>> requirements.
>
> Not sure where this came from, but I don't think it's particularly good
> to be suggesting that one feature is more important than another or that
> we need to have one wait for another as this seems to imply. I'd
> certainly really like to see PG finally have skipping scans, for one
> thing, and it seems like with some effort that might be able to happen
> for v12. I'm not convinced that we're going to get pluggable storage to
> happen in v12 and I don't really agree with recommending that people
> hold off on making improvements to things because it's coming.
>

My point was that I know this patch needs work, so any feedback that get
it closer to a solution will help.

Pluggable storage may / may not add new requirements, but it is up to
the people working on that, some of which are Committers, to take time
"off" to provide feedback for this patch in order steer me in right
direction.

Work can happen in parallel, and I'm definitely not recommending that
people hold off on any patches that they want to provide feedback for,
or submit for a CommitFest.

Best regards,
Jesper

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 19:05:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Jesper Pedersen (jesper(dot)pedersen(at)redhat(dot)com) wrote:
> On 08/16/2018 02:36 PM, Stephen Frost wrote:
> >Not sure where this came from, but I don't think it's particularly good
> >to be suggesting that one feature is more important than another or that
> >we need to have one wait for another as this seems to imply. I'd
> >certainly really like to see PG finally have skipping scans, for one
> >thing, and it seems like with some effort that might be able to happen
> >for v12. I'm not convinced that we're going to get pluggable storage to
> >happen in v12 and I don't really agree with recommending that people
> >hold off on making improvements to things because it's coming.
>
> My point was that I know this patch needs work, so any feedback that get it
> closer to a solution will help.
>
> Pluggable storage may / may not add new requirements, but it is up to the
> people working on that, some of which are Committers, to take time "off" to
> provide feedback for this patch in order steer me in right direction.

I don't think it's really necessary for this work to be suffering under
some concern that pluggable storage will make it have to change. Sure,
it might, but it also very well might not. For my 2c, anyway, this
seems likely to get into the tree before pluggable storage does and it's
pretty unlikely to be the only thing that that work will need to be
prepared to address when it happens.

> Work can happen in parallel, and I'm definitely not recommending that people
> hold off on any patches that they want to provide feedback for, or submit
> for a CommitFest.

Yes, work can happen in parallel, and I don't really think there needs
to be a concern about some other patch set when it comes to getting this
patch committed.

Thanks!

Stephen

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 19:48:34
Message-ID:	CAH2-WznxAN940e=2D8Auz06d=_ZK6F5LVH0HUMbCXGAa=Tirpw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Yeah, there are a few tricks you can do with "index skip scans"
> (Oracle name, or as IBM calls them, "index jump scans"... I was
> slightly tempted to suggest we call ours "index hop scans"...).

Hopscotch scans?

> * groups and certain aggregates (MIN() and MAX() of suffix index
> columns within each group)
> * index scans where the scan key doesn't include the leading columns
> (but you expect there to be sufficiently few values)
> * merge joins (possibly the trickiest and maybe out of range)

FWIW, I suspect that we're going to have the biggest problems in the
optimizer. It's not as if ndistinct is in any way reliable. That may
matter more on average than it has with other path types.

--
Peter Geoghegan

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	jesper(dot)pedersen(at)redhat(dot)com,Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>,Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>,pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>,Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 21:44:19
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On August 16, 2018 8:28:45 PM GMT+02:00, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>One thing to consider is the pluggable storage patch, which is a lot
>more important than this patch. I don't want this patch to get in the
>way of that work, so it may have to wait a bit in order to see any new
>potential requirements.

I don't think this would be a meaningful, relative to the size of the patch sets, amount of conflict between the two. So I don't think we have to consider relative importance (which I don't think is that easy to assess in this case).

Fwiw, I've a significantly further revised version of the tableam patch that I plan to send in a few days. Ported the current zheap patch as a separate AM which helped weed out a lot of issues.

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-16 23:10:56
Message-ID:	CAEepm=0S1JUUwKCK4jhmtTWm9VUTsCe3+b5tSvddpMpkzas7Hw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 17, 2018 at 7:48 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> * groups and certain aggregates (MIN() and MAX() of suffix index
>> columns within each group)
>> * index scans where the scan key doesn't include the leading columns
>> (but you expect there to be sufficiently few values)
>> * merge joins (possibly the trickiest and maybe out of range)
>
> FWIW, I suspect that we're going to have the biggest problems in the
> optimizer. It's not as if ndistinct is in any way reliable. That may
> matter more on average than it has with other path types.

Can you give an example of problematic ndistinct underestimation?

I suppose you might be able to defend against that in the executor: if
you find that you've done an unexpectedly high number of skips, you
could fall back to regular next-tuple mode. Unfortunately that's
require the parent plan node to tolerate non-unique results.

I noticed that the current patch doesn't care about restrictions on
the range (SELECT DISTINCT a FROM t WHERE a BETWEEN 500 and 600), but
that causes it to overestimate the number of btree searches, which is
a less serious problem (it might not chose a skip scan when it would
have been better).

--
Thomas Munro
http://www.enterprisedb.com

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-17 17:15:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

On 08/16/2018 03:48 PM, Peter Geoghegan wrote:
> On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> * groups and certain aggregates (MIN() and MAX() of suffix index
>> columns within each group)
>> * index scans where the scan key doesn't include the leading columns
>> (but you expect there to be sufficiently few values)
>> * merge joins (possibly the trickiest and maybe out of range)
>
> FWIW, I suspect that we're going to have the biggest problems in the
> optimizer. It's not as if ndistinct is in any way reliable. That may
> matter more on average than it has with other path types.
>

Thanks for sharing this; it is very useful to know.

Best regards,
Jesper

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-08-17 17:52:05
Message-ID:	CAH2-Wzkk0+vSAiaRSkX84KR0SofKkY1yGHEcLoRBXT-JjM37bw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 16, 2018 at 4:10 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Can you give an example of problematic ndistinct underestimation?

Yes. See https://postgr.es/m/CAKuK5J12QokFh88tQz-oJMSiBg2QyjM7K7HLnbYi3Ze+Y5BtWQ@mail.gmail.com,
for example. That's a complaint about an underestimation specifically.

This seems to come up about once every 3 years, at least from my
perspective. I'm always surprised that ndistinct doesn't get
implicated in bad query plans more frequently.

> I suppose you might be able to defend against that in the executor: if
> you find that you've done an unexpectedly high number of skips, you
> could fall back to regular next-tuple mode. Unfortunately that's
> require the parent plan node to tolerate non-unique results.

I like the idea of dynamic fallback in certain situations, but the
details always seem complicated.

--
Peter Geoghegan

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	pg(at)bowt(dot)ie
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-10 21:47:06
Message-ID:	CA+q6zcV-wgnSMu-YOEmyXVckncQyriiKtUZXm=Gx2s+FV=5MDw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> I took Thomas Munro's previous patch [2] on the subject, added a GUC, a
> test case, documentation hooks, minor code cleanups, and made the patch
> pass an --enable-cassert make check-world run. So, the overall design is
> the same.

I've looked through the patch more closely, and have a few questions:

* Is there any reason why only copy function for the IndexOnlyScan node
includes an implementation for distinctPrefix? Without read/out functionality
skip doesn't work for parallel scans, so it becomes like that:

=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
QUERY PLAN

-------------------------------------------------------------------------------
Gather (cost=1000.43..1001.60 rows=3 width=4)
(actual time=11.054..17672.010 rows=10000000 loops=1)
Output: b
Workers Planned: 1
Workers Launched: 1
Single Copy: true
Buffers: shared hit=91035 read=167369
-> Index Skip Scan using idx_t1_b on public.t1
(cost=0.43..1.30 rows=3 width=4)
(actual time=1.350..16065.165 rows=10000000 loops=1)
Output: b
Heap Fetches: 10000000
Buffers: shared hit=91035 read=167369
Worker 0: actual time=1.350..16065.165 rows=10000000 loops=1
Buffers: shared hit=91035 read=167369
Planning Time: 0.394 ms
Execution Time: 6037.800 ms

and with this functionality it gets better:

=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
QUERY PLAN

-------------------------------------------------------------------------------
Gather (cost=1000.43..1001.60 rows=3 width=4)
(actual time=3.564..4.427 rows=3 loops=1)
Output: b
Workers Planned: 1
Workers Launched: 1
Single Copy: true
Buffers: shared hit=4 read=10
-> Index Skip Scan using idx_t1_b on public.t1
(cost=0.43..1.30 rows=3 width=4)
(actual time=0.065..0.133 rows=3 loops=1)
Output: b
Heap Fetches: 3
Buffers: shared hit=4 read=10
Worker 0: actual time=0.065..0.133 rows=3 loops=1
Buffers: shared hit=4 read=10
Planning Time: 1.724 ms
Execution Time: 4.522 ms

* What is the purpose of HeapFetches? I don't see any usage of this variable
except assigning 0 to it once.

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-11 13:21:57
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Dmitry,

On 9/10/18 5:47 PM, Dmitry Dolgov wrote:
>> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> I've looked through the patch more closely, and have a few questions:
>

Thanks for your review !

> * Is there any reason why only copy function for the IndexOnlyScan node
> includes an implementation for distinctPrefix?

Oversight -- thanks for catching that.

> * What is the purpose of HeapFetches? I don't see any usage of this variable
> except assigning 0 to it once.
>

That can be removed.

New version WIP v2 attached.

Best regards,
Jesper

Attachment	Content-Type	Size
wip_indexskipscan_v2.patch	text/x-patch	31.2 KB

From:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
To:	jesper(dot)pedersen(at)redhat(dot)com, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-13 13:01:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Jesper,

While testing this patch I noticed that current implementation doesn't
perform well when we have lots of small groups of equal values. Here is
the execution time of index skip scan vs unique over index scan, in ms,
depending on the size of group. The benchmark script is attached.

group size    skip unique
1             2,293.85    132.55
5    464.40 106.59
10      239.61    102.02
50 56.59       98.74
100       32.56   103.04
500 6.08      97.09

So, the current implementation can lead to performance regression, and
the choice of the plan depends on the notoriously unreliable ndistinct
statistics. The regression is probably because skip scan always does
_bt_search to find the next unique tuple. I think we can improve this,
and the skip scan can be strictly faster than index scan regardless of
the data. As a first approximation, imagine that we somehow skipped
equal tuples inside _bt_next instead of sending them to the parent
Unique node. This would already be marginally faster than Unique + Index
scan. A more practical implementation would be to remember our position
in tree (that is, BTStack returned by _bt_search) and use it to skip
pages in bulk. This looks straightforward to implement for a tree that
does not change, but I'm not sure how to make it work with concurrent
modifications. Still, this looks a worthwhile direction to me, because
if we have a strictly faster skip scan, we can just use it always and
not worry about our unreliable statistics. What do you think?

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
test-skip.sql	application/sql	571 bytes

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-13 15:39:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander.

On 9/13/18 9:01 AM, Alexander Kuzmenkov wrote:
> While testing this patch

Thanks for the review !

> I noticed that current implementation doesn't
> perform well when we have lots of small groups of equal values. Here is
> the execution time of index skip scan vs unique over index scan, in ms,
> depending on the size of group. The benchmark script is attached.
>
> group size    skip unique
> 1             2,293.85    132.55
> 5    464.40 106.59
> 10      239.61    102.02
> 50 56.59       98.74
> 100       32.56   103.04
> 500 6.08      97.09
>

Yes, this doesn't look good. Using your test case I'm seeing that unique
is being chosen when the group size is below 34, and skip above. This is
with the standard initdb configuration; did you change something else ?
Or did you force the default plan ?

> So, the current implementation can lead to performance regression, and
> the choice of the plan depends on the notoriously unreliable ndistinct
> statistics.

Yes, Peter mentioned this, which I'm still looking at.

> The regression is probably because skip scan always does
> _bt_search to find the next unique tuple.

Very likely.

> I think we can improve this,
> and the skip scan can be strictly faster than index scan regardless of
> the data. As a first approximation, imagine that we somehow skipped
> equal tuples inside _bt_next instead of sending them to the parent
> Unique node. This would already be marginally faster than Unique + Index
> scan. A more practical implementation would be to remember our position
> in tree (that is, BTStack returned by _bt_search) and use it to skip
> pages in bulk. This looks straightforward to implement for a tree that
> does not change, but I'm not sure how to make it work with concurrent
> modifications. Still, this looks a worthwhile direction to me, because
> if we have a strictly faster skip scan, we can just use it always and
> not worry about our unreliable statistics. What do you think?
>

This is something to look at -- maybe there is a way to use
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
have the scan key in BTScanOpaqueData. I'll take a look after my
upcoming vacation; feel free to contribute those changes in the meantime
of course.

Thanks again !

Best regards,
Jesper

From:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
To:	jesper(dot)pedersen(at)redhat(dot)com
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-13 19:36:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>
> Yes, this doesn't look good. Using your test case I'm seeing that
> unique is being chosen when the group size is below 34, and skip
> above. This is with the standard initdb configuration; did you change
> something else ? Or did you force the default plan ?

Sorry I didn't mention this, the first column is indeed forced skip
scan, just to see how it compares to index scan.

> This is something to look at -- maybe there is a way to use
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we
> just have the scan key in BTScanOpaqueData. I'll take a look after my
> upcoming vacation; feel free to contribute those changes in the
> meantime of course.
>

I probably won't be able to contribute the changes, but I'll try to
review them.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	a(dot)kuzmenkov(at)postgrespro(dot)ru
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-15 19:52:39
Message-ID:	CA+q6zcUW=GXiw24iiLee3ZhA4JY4wc1hmGE8+20v-W47sOQjNA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>
> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>
>> I think we can improve this,
>> and the skip scan can be strictly faster than index scan regardless of
>> the data. As a first approximation, imagine that we somehow skipped
>> equal tuples inside _bt_next instead of sending them to the parent
>> Unique node. This would already be marginally faster than Unique + Index
>> scan. A more practical implementation would be to remember our position
>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>> pages in bulk. This looks straightforward to implement for a tree that
>> does not change, but I'm not sure how to make it work with concurrent
>> modifications. Still, this looks a worthwhile direction to me, because
>> if we have a strictly faster skip scan, we can just use it always and
>> not worry about our unreliable statistics. What do you think?
>>
>
> This is something to look at -- maybe there is a way to use
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
> have the scan key in BTScanOpaqueData. I'll take a look after my
> upcoming vacation; feel free to contribute those changes in the meantime
> of course.

But having this logic inside _bt_next means that it will make a non-skip index
only scan a bit slower, am I right? Probably it would be easier and more
straightforward to go with the idea of dynamic fallback then. The first naive
implementation that I came up with is to wrap an index scan node into a unique,
and remember estimated number of groups into IndexOnlyScanState, so that we can
check if we performed much more skips than expected. With this approach index
skip scan will work a bit slower than in the original patch in case if
ndistinct is correct (because a unique node will recheck rows we returned), and
fallback to unique + index only scan in case if planner has underestimated
ndistinct.

Attachment	Content-Type	Size
index-skip-fallback.patch	application/octet-stream	31.7 KB

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-27 13:59:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Dmitry,

On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
>> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>>> I think we can improve this,
>>> and the skip scan can be strictly faster than index scan regardless of
>>> the data. As a first approximation, imagine that we somehow skipped
>>> equal tuples inside _bt_next instead of sending them to the parent
>>> Unique node. This would already be marginally faster than Unique + Index
>>> scan. A more practical implementation would be to remember our position
>>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>>> pages in bulk. This looks straightforward to implement for a tree that
>>> does not change, but I'm not sure how to make it work with concurrent
>>> modifications. Still, this looks a worthwhile direction to me, because
>>> if we have a strictly faster skip scan, we can just use it always and
>>> not worry about our unreliable statistics. What do you think?
>>>
>>
>> This is something to look at -- maybe there is a way to use
>> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
>> have the scan key in BTScanOpaqueData. I'll take a look after my
>> upcoming vacation; feel free to contribute those changes in the meantime
>> of course.
>
> But having this logic inside _bt_next means that it will make a non-skip index
> only scan a bit slower, am I right?

Correct.

> Probably it would be easier and more
> straightforward to go with the idea of dynamic fallback then. The first naive
> implementation that I came up with is to wrap an index scan node into a unique,
> and remember estimated number of groups into IndexOnlyScanState, so that we can
> check if we performed much more skips than expected. With this approach index
> skip scan will work a bit slower than in the original patch in case if
> ndistinct is correct (because a unique node will recheck rows we returned), and
> fallback to unique + index only scan in case if planner has underestimated
> ndistinct.
>

I think we need a comment on this in the patch, as 10 *
node->ioss_PlanRows looks a bit random.

Thanks for your contribution !

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-09-27 20:28:34
Message-ID:	CA+q6zcXdP=PCaVRx3OgpyKK+NgNfrnnSA66rkvxRLeZrkYjEaA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, 27 Sep 2018 at 15:59, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> I think we need a comment on this in the patch, as 10 *
> node->ioss_PlanRows looks a bit random.

Yeah, you're right, it's a bit arbitrary number - we just need to make sure
that this estimation is not too small (to avoid false positives), but also not
too big (to not miss a proper point for fallback). I left it uncommented mostly
because I wanted to get some feedback on it first, and probably some
suggestions about how to make this estimation better.

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Thomas Munro <munro(at)ip9(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-10-09 13:42:24
Message-ID:	153909254478.1479.11809244127214809171.pgcf@coridan.postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I tested last patch and I have some notes:

postgres=# explain select distinct a10000 from foo;
+-------------------------------------------------------------------------------------------+
| QUERY PLAN |
+-------------------------------------------------------------------------------------------+
| Unique (cost=0.43..4367.56 rows=9983 width=4) |
| -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
+-------------------------------------------------------------------------------------------+
(2 rows)

In this case Unique node is useless and can be removed

2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

3. Once time patched postgres crashed, but I am not able to reproduce it.

Looks like very interesting patch, and important for some BI platforms

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Thomas Munro <munro(at)ip9(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-10-09 13:58:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Pavel,

On 10/9/18 9:42 AM, Pavel Stehule wrote:
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed
>
> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
>
> 3. Once time patched postgres crashed, but I am not able to reproduce it.
>

Please, send that query through if you can replicate it. The patch
currently passes an assert'ed check-world, so your query clearly
triggered something that isn't covered yet.

> Looks like very interesting patch, and important for some BI platforms
>

Thanks for your review !

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, munro(at)ip9(dot)org, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-10-09 13:59:28
Message-ID:	CA+q6zcVZDb5Ghexe4FuLD139x1c2W4e1b3at6=OkE_sab=XJ+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed

Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).

> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.

> 3. Once time patched postgres crashed, but I am not able to reproduce it.

Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <munro(at)ip9(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-10-09 14:13:03
Message-ID:	CAFj8pRCUT4ycXo2BW1JAKt82Zh7kOjH2TiHgqWS41uViPPW5SA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
napsal:

> > On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
> >
> > Hi
> >
> > I tested last patch and I have some notes:
> >
> > 1.
> >
> > postgres=# explain select distinct a10000 from foo;
> >
> +-------------------------------------------------------------------------------------------+
> > | QUERY PLAN
> |
> >
> +-------------------------------------------------------------------------------------------+
> > | Unique (cost=0.43..4367.56 rows=9983 width=4)
> |
> > | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60
> rows=9983 width=4) |
> >
> +-------------------------------------------------------------------------------------------+
> > (2 rows)
> >
> > In this case Unique node is useless and can be removed
>
> Just to clarify which exactly version were you testing? If
> index-skip-fallback.patch,
> then the Unique node was added there to address the situation when
> ndistinct is underestimated, with an idea to fallback to original plan
> (and to tolerate that I suggested to use Unique, since we don't know
> if fallback will happen or not during the planning).
>

I tested index-skip-fallback.patch.

It looks like good idea, but then the node should be named "index scan" and
other info can be displayed in detail parts. It can be similar like "sort".

The combination of unique and index skip scan looks strange :)

> > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
>
> Yep, as far as I understand MIN/MAX is going to be the next step after this
> patch will be accepted.
>

Now, the development cycle is starting - maybe it can use same
infrastructure like MIN/MAX and this part can be short.

more if you use dynamic index scan

> > 3. Once time patched postgres crashed, but I am not able to reproduce it.
>
> Maybe you have at least some ideas what could cause that or what's the
> possible
> way to reproduce that doesn't work anymore?
>

I think it was query like

select count(*) from (select distinct x from tab) s

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <munro(at)ip9(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-10-09 16:12:31
Message-ID:	CAFj8pRD1BgN2VsBqd2aQp852FxGrb-Vwfry4ASDXK7q-SXdQDQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

út 9. 10. 2018 v 16:13 odesílatel Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
napsal:

>
>
> út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
> napsal:
>
>> > On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
>> wrote:
>> >
>> > Hi
>> >
>> > I tested last patch and I have some notes:
>> >
>> > 1.
>> >
>> > postgres=# explain select distinct a10000 from foo;
>> >
>> +-------------------------------------------------------------------------------------------+
>> > | QUERY PLAN
>> |
>> >
>> +-------------------------------------------------------------------------------------------+
>> > | Unique (cost=0.43..4367.56 rows=9983 width=4)
>> |
>> > | -> Index Skip Scan using foo_a10000_idx on foo
>> (cost=0.43..4342.60 rows=9983 width=4) |
>> >
>> +-------------------------------------------------------------------------------------------+
>> > (2 rows)
>> >
>> > In this case Unique node is useless and can be removed
>>
>> Just to clarify which exactly version were you testing? If
>> index-skip-fallback.patch,
>> then the Unique node was added there to address the situation when
>> ndistinct is underestimated, with an idea to fallback to original plan
>> (and to tolerate that I suggested to use Unique, since we don't know
>> if fallback will happen or not during the planning).
>>
>
> I tested index-skip-fallback.patch.
>
> It looks like good idea, but then the node should be named "index scan"
> and other info can be displayed in detail parts. It can be similar like
> "sort".
>
> The combination of unique and index skip scan looks strange :)
>

maybe we don't need special index skip scan node - maybe possibility to
return unique values from index scan node can be good enough - some like
"distinct index scan" - and the implementation there can be dynamic -skip
scan, classic index scan,

"index skip scan" is not good name if the implementaion is dynamic.

>
>> > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
>>
>> Yep, as far as I understand MIN/MAX is going to be the next step after
>> this
>> patch will be accepted.
>>
>
> ok
>
> Now, the development cycle is starting - maybe it can use same
> infrastructure like MIN/MAX and this part can be short.
>
> more if you use dynamic index scan
>
>
>> > 3. Once time patched postgres crashed, but I am not able to reproduce
>> it.
>>
>> Maybe you have at least some ideas what could cause that or what's the
>> possible
>> way to reproduce that doesn't work anymore?
>>
>
> I think it was query like
>
> select count(*) from (select distinct x from tab) s
>
>
>

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <munro(at)ip9(dot)org>
Subject:	Re: Index Skip Scan
Date:	2018-10-10 15:34:27
Message-ID:	CA+q6zcWd1EjYGpV9cqTU6S10rvu4LEVbGm_g_6yEA9LeH+afNA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
>> It looks like good idea, but then the node should be named "index scan" and
>> other info can be displayed in detail parts. It can be similar like "sort".
>> The combination of unique and index skip scan looks strange :)
>
> maybe we don't need special index skip scan node - maybe possibility to
> return unique values from index scan node can be good enough - some like
> "distinct index scan" - and the implementation there can be dynamic -skip
> scan, classic index scan,
>
> "index skip scan" is not good name if the implementaion is dynamic.

Yeah, that's a valid point. The good part is that index skip scan is not really
a separate node, but just enhanced index only scan node. So indeed maybe it
would be better to call it Index Only Scan, but show in details that we apply
the skip scan strategy. Any other opinions about this?

>> I think it was query like
>> select count(*) from (select distinct x from tab) s

Thanks, I'll take a look.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, bhushan(dot)uparkar(at)gmail(dot)com, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-10-12 17:43:48
Message-ID:	CA+TgmoY7QTHhzLWZupNSyyqFRBfMgYocg3R-6g=DRgT4-KBGqg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > I noticed that current implementation doesn't
> > perform well when we have lots of small groups of equal values. Here is
> > the execution time of index skip scan vs unique over index scan, in ms,
> > depending on the size of group. The benchmark script is attached.
> >
> > group size skip unique
> > 1 2,293.85 132.55
> > 5 464.40 106.59
> > 10 239.61 102.02
> > 50 56.59 98.74
> > 100 32.56 103.04
> > 500 6.08 97.09
> >
>
> Yes, this doesn't look good. Using your test case I'm seeing that unique
> is being chosen when the group size is below 34, and skip above.

I'm not sure exactly how the current patch is approaching the problem,
but it seems like it might be reasonable to do something like -- look
for a distinct item within the current page; if not found, then search
from the root of the tree for the next item > the current item.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-10-16 19:22:18
Message-ID:	CA+q6zcV5XmtTFRVKdkK6nhd4Xf4D-SzzLHjvkkXFdnQJBV-pNw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, 12 Oct 2018 at 19:44, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > > I noticed that current implementation doesn't
> > > perform well when we have lots of small groups of equal values. Here is
> > > the execution time of index skip scan vs unique over index scan, in ms,
> > > depending on the size of group. The benchmark script is attached.
> > >
> > > group size skip unique
> > > 1 2,293.85 132.55
> > > 5 464.40 106.59
> > > 10 239.61 102.02
> > > 50 56.59 98.74
> > > 100 32.56 103.04
> > > 500 6.08 97.09
> > >
> >
> > Yes, this doesn't look good. Using your test case I'm seeing that unique
> > is being chosen when the group size is below 34, and skip above.
>
> I'm not sure exactly how the current patch is approaching the problem,
> but it seems like it might be reasonable to do something like -- look
> for a distinct item within the current page; if not found, then search
> from the root of the tree for the next item > the current item.

I'm not sure that I understand it correctly, can you elaborate please? From
what I see it's quite similar to what's already implemented - we look for a
distinct item on the page, and then search the index tree for a next distinct
item.

> On Wed, 10 Oct 2018 at 17:34, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> >
> >> It looks like good idea, but then the node should be named "index scan" and
> >> other info can be displayed in detail parts. It can be similar like "sort".
> >> The combination of unique and index skip scan looks strange :)
> >
> > maybe we don't need special index skip scan node - maybe possibility to
> > return unique values from index scan node can be good enough - some like
> > "distinct index scan" - and the implementation there can be dynamic -skip
> > scan, classic index scan,
> >
> > "index skip scan" is not good name if the implementaion is dynamic.
>
> Yeah, that's a valid point. The good part is that index skip scan is not really
> a separate node, but just enhanced index only scan node. So indeed maybe it
> would be better to call it Index Only Scan, but show in details that we apply
> the skip scan strategy. Any other opinions about this?

To make it more clean what I mean, see attached version of the patch.

> >> I think it was query like
> >> select count(*) from (select distinct x from tab) s
>
> Thanks, I'll take a look.

I couldn't reproduce it either yet.

Attachment	Content-Type	Size
index-skip-fallback-v2.patch	application/octet-stream	31.7 KB

From:	Sergei Kornilov <sk(at)zsrv(dot)org>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, "a(dot)kuzmenkov(at)postgrespro(dot)ru" <a(dot)kuzmenkov(at)postgrespro(dot)ru>, "pg(at)bowt(dot)ie" <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-12 12:28:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello

Last published patch index-skip-fallback-v2 applied and builds cleanly.

I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> create table tablename (i int primary key);
> select distinct i from tablename where i = 1;
Query is obviously strange, but this is bug.

Also i noticed two TODO in documentation.

regards, Sergei

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	sk(at)zsrv(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-12 12:55:28
Message-ID:	CA+q6zcXAZTPKe4q-AhPzNjRmBNO2divKbDuuco4ROiXiUgn=sA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk(at)zsrv(dot)org> wrote:
>
> I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> > create table tablename (i int primary key);
> > select distinct i from tablename where i = 1;
> Query is obviously strange, but this is bug.

Wow, thanks a lot! I can reproduce it too, will fix it.

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	sk(at)zsrv(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-14 16:48:47
Message-ID:	CA+q6zcWQyiFOkH9JC_tNgRXXHvRSq+a9De8VAB3fJun6VJQQOw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, 12 Nov 2018 at 13:55, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk(at)zsrv(dot)org> wrote:
> >
> > I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> > > create table tablename (i int primary key);
> > > select distinct i from tablename where i = 1;
> > Query is obviously strange, but this is bug.
>
> Wow, thanks a lot! I can reproduce it too, will fix it.

Yep, we had to check number of distinct columns too, here is the fixed patch
(with a bit more verbose commit message).

Attachment	Content-Type	Size
0001-Index-skip-scan-with-fallback-v3.patch	application/octet-stream	34.8 KB

From:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
To:	jesper(dot)pedersen(at)redhat(dot)com, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-15 11:41:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/27/18 16:59, Jesper Pedersen wrote:
> Hi Dmitry,
>
> On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
>>> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov
>>> <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>>> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>>>> I think we can improve this,
>>>> and the skip scan can be strictly faster than index scan regardless of
>>>> the data.
>>>> <...>
>>>>
>>>
>>> This is something to look at -- maybe there is a way to use
>>> btpo_next/btpo_prev instead/too in order to speed things up. Atm we
>>> just
>>> have the scan key in BTScanOpaqueData. I'll take a look after my
>>> upcoming vacation; feel free to contribute those changes in the
>>> meantime
>>> of course.
>>
>> But having this logic inside _bt_next means that it will make a
>> non-skip index
>> only scan a bit slower, am I right?
>
> Correct.

Well, it depends on how the skip scan is implemented. We don't have to
make normal scans slower, because skip scan is just a separate thing.

My main point was that current implementation is good as a proof of
concept, but it is inefficient for some data and needs some unreliable
planner logic to work around this inefficiency. And now we also have
execution-time fallback because planning-time fallback isn't good
enough. This looks overly complicated. Let's try to design an algorithm
that works well regardless of the particular data and doesn't need these
heuristics. It should be possible to do so for btree.

Of course, I understand the reluctance to implement an entire new type
of btree traversal. Anastasia Lubennikova suggested a tweak for the
current method that may improve the performance for small groups of
equal values. When we search for the next unique key, first check if it
is contained on the current btree page using its 'high key'. If it is,
find it on the current page. If not, search from the root of the tree
like we do now. This should improve the performance for small equal
groups, because there are going to be several such groups on the page.
And this is exactly where we have the regression compared to unique +
index scan.

By the way, what is the data for which we intend this feature to work?
Obviously a non-unique btree index, but how wide are the tuples, and how
big the equal groups? It would be good to have some real-world examples.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	a(dot)kuzmenkov(at)postgrespro(dot)ru
Cc:	jesper(dot)pedersen(at)redhat(dot)com, 9erthalion6(at)gmail(dot)com, pg(at)bowt(dot)ie, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-15 14:28:03
Message-ID:	CAAaqYe_+q=iqDzd8cJsYpcG+4JtK1_zdv01EKtLJFZ4bcpmQHg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Is skip scan only possible for index-only scan?

I haven't see much discussion of this question yet. Is there a
particular reason to lock ourselves into thinking about this only in
an index only scan?

>> I think we can improve this,
>> and the skip scan can be strictly faster than index scan regardless of
>> the data. As a first approximation, imagine that we somehow skipped
>> equal tuples inside _bt_next instead of sending them to the parent
>> Unique node. This would already be marginally faster than Unique + Index
>> scan. A more practical implementation would be to remember our position
>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>> pages in bulk. This looks straightforward to implement for a tree that
>> does not change, but I'm not sure how to make it work with concurrent
>> modifications. Still, this looks a worthwhile direction to me, because
>> if we have a strictly faster skip scan, we can just use it always and
>> not worry about our unreliable statistics. What do you think?
>>
>
> This is something to look at -- maybe there is a way to use
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
> have the scan key in BTScanOpaqueData. I'll take a look after my
> upcoming vacation; feel free to contribute those changes in the meantime
> of course.

It seems to me also that the logic necessary for this kind of
traversal has other useful applications. For example, it should be
possible to build on that logic to allow and index like t(owner_fk,
created_at) to be used to execute the following query:

select *
from t
where owner_fk in (1,2,3)
order by created_at
limit 25

without needing to fetch all tuples satisfying "owner_fk in (1,2,3)"
and subsequently sorting them.

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2018-11-16 15:06:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
>>> But having this logic inside _bt_next means that it will make a
>>> non-skip index
>>> only scan a bit slower, am I right?
>>
>> Correct.
>
> Well, it depends on how the skip scan is implemented. We don't have to
> make normal scans slower, because skip scan is just a separate thing.
>
> My main point was that current implementation is good as a proof of
> concept, but it is inefficient for some data and needs some unreliable
> planner logic to work around this inefficiency. And now we also have
> execution-time fallback because planning-time fallback isn't good
> enough. This looks overly complicated. Let's try to design an algorithm
> that works well regardless of the particular data and doesn't need these
> heuristics. It should be possible to do so for btree.
>
> Of course, I understand the reluctance to implement an entire new type
> of btree traversal. Anastasia Lubennikova suggested a tweak for the
> current method that may improve the performance for small groups of
> equal values. When we search for the next unique key, first check if it
> is contained on the current btree page using its 'high key'. If it is,
> find it on the current page. If not, search from the root of the tree
> like we do now. This should improve the performance for small equal
> groups, because there are going to be several such groups on the page.
> And this is exactly where we have the regression compared to unique +
> index scan.
>

Robert suggested something similar in [1]. I'll try and look at that
when I'm back from my holiday.

> By the way, what is the data for which we intend this feature to work?
> Obviously a non-unique btree index, but how wide are the tuples, and how
> big the equal groups? It would be good to have some real-world examples.
>

Although my primary use-case is int I agree that we should test the
different data types, and tuple widths.

[1]
https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, a(dot)kuzmenkov(at)postgrespro(dot)ru
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2018-11-17 23:27:07
Message-ID:	CA+q6zcW4vZrRivgX47yv+uh+ND0A09kMz4yCa-N6jY69Vs-VjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, 16 Nov 2018 at 16:06, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
> >>> But having this logic inside _bt_next means that it will make a
> >>> non-skip index
> >>> only scan a bit slower, am I right?
> >>
> >> Correct.
> >
> > Well, it depends on how the skip scan is implemented. We don't have to
> > make normal scans slower, because skip scan is just a separate thing.
> >
> > My main point was that current implementation is good as a proof of
> > concept, but it is inefficient for some data and needs some unreliable
> > planner logic to work around this inefficiency. And now we also have
> > execution-time fallback because planning-time fallback isn't good
> > enough. This looks overly complicated. Let's try to design an algorithm
> > that works well regardless of the particular data and doesn't need these
> > heuristics. It should be possible to do so for btree.
> >
> > Of course, I understand the reluctance to implement an entire new type
> > of btree traversal. Anastasia Lubennikova suggested a tweak for the
> > current method that may improve the performance for small groups of
> > equal values. When we search for the next unique key, first check if it
> > is contained on the current btree page using its 'high key'. If it is,
> > find it on the current page. If not, search from the root of the tree
> > like we do now. This should improve the performance for small equal
> > groups, because there are going to be several such groups on the page.
> > And this is exactly where we have the regression compared to unique +
> > index scan.
>
> Robert suggested something similar in [1]. I'll try and look at that
> when I'm back from my holiday.

Yeah, probably you're right. Unfortunately, I've misunderstood the previous
Robert's message in this thread with the similar approach. Jesper, I hope you
don't mind if I'll post an updated patch? _bt_skip is changed there in the
suggested way, so that it checks the current page before searching from the
root of a tree, and I've removed the fallback logic. After some
initial tests I see
that with this version skip scan over a table with 10^7 rows and 10^6
distinct values is slightly slower than a regular scan, but not that much.

> > By the way, what is the data for which we intend this feature to work?
> > Obviously a non-unique btree index, but how wide are the tuples, and how
> > big the equal groups? It would be good to have some real-world examples.
>
> Although my primary use-case is int I agree that we should test the
> different data types, and tuple widths.

My personal motivation here is exactly that we face use-cases for skip scan
from time to time. Usually it's quite few distinct values (up to a dozen or
so, which means that equal groups are quite big), but with the variety of types
and widths.

> On Thu, 15 Nov 2018 at 15:28, James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> > Is skip scan only possible for index-only scan?
>
> I haven't see much discussion of this question yet. Is there a
> particular reason to lock ourselves into thinking about this only in
> an index only scan?

I guess, the only reason is to limit the scope of the patch.

Attachment	Content-Type	Size
0001-Index-skip-scan-v4.patch	application/octet-stream	35.4 KB

From:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2018-11-21 15:38:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/18/18 02:27, Dmitry Dolgov wrote:
>
> [0001-Index-skip-scan-v4.patch]

I ran a couple of tests on this, please see the cases below. As before,
I'm setting total_cost = 1 for index skip scan so that it is chosen.
Case 1 breaks because we determine the high key incorrectly, it is the
second tuple on page or something like that, not the last tuple. Case 2
is backwards scan, I don't understand how it is supposed to work. We
call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it
just fetches the previous tuple like the regular scan does.

case 1:
# create table t as select generate_series(1, 1000000) a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t) d;
                               QUERY PLAN
-------------------------------------------------------------------------
Aggregate (cost=3.50..3.51 rows=1 width=8)
   -> Index Only Scan using ta on t (cost=0.42..1.00 rows=200 width=4)
         Scan mode: Skip scan
(3 rows)

postgres=# select count(*) from (select distinct a from t) d;
count
--------
500000 -- should be 1kk
(1 row)

case 2:
# create table t as select generate_series(1, 1000000) / 2 a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t order by a desc) d;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5980.81..5980.82 rows=1 width=8)
   -> Index Only Scan Backward using ta on t (cost=0.42..1.00
rows=478385 width=4)
         Scan mode: Skip scan
(3 rows)

# select count(*) from (select distinct a from t order by a desc) d;
count
--------
502733 -- should be 500k

(1 row)

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	a(dot)kuzmenkov(at)postgrespro(dot)ru
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pg(at)bowt(dot)ie, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-11-21 20:56:34
Message-ID:	CA+q6zcW+iT5xkA7meNm4H=8Dw-Qb_YJwtVDN3Zyfju=GiKS6UQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
>
> On 11/18/18 02:27, Dmitry Dolgov wrote:
> >
> > [0001-Index-skip-scan-v4.patch]
>
> I ran a couple of tests on this, please see the cases below. As before,
> I'm setting total_cost = 1 for index skip scan so that it is chosen.
> Case 1 breaks because we determine the high key incorrectly, it is the
> second tuple on page or something like that, not the last tuple.

From what I see it wasn't about the high key, just a regular off by one error.
But anyway, thanks for noticing - for some reason it wasn't always
reproduceable for me, so I missed this issue. Please find fixed patch attached.
Also I think it invalidates my previous performance tests, so I would
appreciate if you can check it out too.

> Case 2
> is backwards scan, I don't understand how it is supposed to work. We
> call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it
> just fetches the previous tuple like the regular scan does.

Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
remember from the previous discussions the entire topic of backward scan is
questionable for this patch, so I'll try to invest some time in it.

Attachment	Content-Type	Size
0001-Index-skip-scan-v5.patch	application/octet-stream	35.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-12-04 03:26:31
Message-ID:	CAH2-WzksysZ7nL4UnUsH4Sfak-XNOw3npCV25_dV16Q5MMgoGA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
> remember from the previous discussions the entire topic of backward scan is
> questionable for this patch, so I'll try to invest some time in it.

Another thing that I think is related to skip scans that you should be
aware of is dynamic prefix truncation, which I started a thread on
just now [1]. While I see one big problem with the POC patch I came up
with, I think that that optimization is going to be something that
ends up happening at some point. Repeatedly descending a B-Tree when
the leading column is very low cardinality can be made quite a lot
less expensive by dynamic prefix truncation. Actually, it's almost a
perfect case for it.

I'm not asking anybody to do anything with that information. "Big
picture" thinking seems particularly valuable when working on the
B-Tree code; I don't want anybody to miss a possible future
opportunity.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
--
Peter Geoghegan

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2018-12-20 13:46:09
Message-ID:	CA+q6zcVPs2vA=ym5tkj7JJo2pjKRnJgYn91+PDkpmCTDwz=oyw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
> >
> > On 11/18/18 02:27, Dmitry Dolgov wrote:
> > >
> > > [0001-Index-skip-scan-v4.patch]
> >
> > I ran a couple of tests on this, please see the cases below. As before,
> > I'm setting total_cost = 1 for index skip scan so that it is chosen.
> > Case 1 breaks because we determine the high key incorrectly, it is the
> > second tuple on page or something like that, not the last tuple.
>
> From what I see it wasn't about the high key, just a regular off by one error.
> But anyway, thanks for noticing - for some reason it wasn't always
> reproduceable for me, so I missed this issue. Please find fixed patch attached.
> Also I think it invalidates my previous performance tests, so I would
> appreciate if you can check it out too.

I've performed some testing, and on my environment with a dataset of 10^7
records:

* everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
scan.

* above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
records it was about 20% slower than the regular index scan.

For me these numbers sound good, since even in quite extreme case of
approximately 10 records per group the performance of index skip scan is close
to the same for the regular index only scan.

> On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
> > remember from the previous discussions the entire topic of backward scan is
> > questionable for this patch, so I'll try to invest some time in it.
>
> Another thing that I think is related to skip scans that you should be
> aware of is dynamic prefix truncation, which I started a thread on
> just now [1]. While I see one big problem with the POC patch I came up
> with, I think that that optimization is going to be something that
> ends up happening at some point. Repeatedly descending a B-Tree when
> the leading column is very low cardinality can be made quite a lot
> less expensive by dynamic prefix truncation. Actually, it's almost a
> perfect case for it.

Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-01-26 17:45:54
Message-ID:	CA+q6zcVFc3+Wm_EfsEif5oy9HKxA+wm9Z2iPSzecSCBKANg5Mw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Dec 20, 2018 at 2:46 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> I've performed some testing, and on my environment with a dataset of 10^7
> records:
>
> * everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
> scan.
>
> * above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
> records it was about 20% slower than the regular index scan.
>
> For me these numbers sound good, since even in quite extreme case of
> approximately 10 records per group the performance of index skip scan is close
> to the same for the regular index only scan.

Rebased version after rd_amroutine was renamed.

Attachment	Content-Type	Size
0001-Index-skip-scan-v6.patch	application/octet-stream	44.8 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-01-27 17:17:46
Message-ID:	CA+q6zcWx9d+QWyM7aLLKmDAa4+PC+x-v3bKmVxXsgH_EcJ7eiA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> Rebased version after rd_amroutine was renamed.

And one more to fix the documentation. Also I've noticed few TODOs in the patch
about the missing docs, and replaced them with a required explanation of the
feature.

Attachment	Content-Type	Size
0001-Index-skip-scan-v7.patch	application/octet-stream	36.6 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-01-30 17:19:05
Message-ID:	CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Jan 27, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> >
> > Rebased version after rd_amroutine was renamed.
>
> And one more to fix the documentation. Also I've noticed few TODOs in the patch
> about the missing docs, and replaced them with a required explanation of the
> feature.

A bit of adjustment after nodes/relation -> nodes/pathnodes.

Attachment	Content-Type	Size
0001-Index-skip-scan-v8.patch	application/octet-stream	36.6 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	jesper(dot)pedersen(at)redhat(dot)com, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2019-01-31 06:31:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w(at)mail(dot)gmail(dot)com>
> A bit of adjustment after nodes/relation -> nodes/pathnodes.

I had a look on this.

The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)

Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.

If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).

=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.

Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.

Any opinions?

====

Some comments on the patch below.

+ skip scan approach, which is based on the idea of
+ <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+ Loose index scan</ulink>. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the

I'm not sure it is a good thing to put a pointer to rather
unstable source in the documentation.

This adds a new AM method but it seems avaiable only for ordered
indexes, specifically btree. And it seems that the feature can be
implemented in btgettuple since btskip apparently does the same
thing. (I agree to Robert in the point in [1]).

[1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com

Related to the above, it seems better that the path generation of
skip scan is a part of index scan. Whether skip scan or not is a
matter of index scan itself.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, 9erthalion6(at)gmail(dot)com
Cc:	a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2019-02-01 19:24:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 1/31/19 1:31 AM, Kyotaro HORIGUCHI wrote:
> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w(at)mail(dot)gmail(dot)com>
>> A bit of adjustment after nodes/relation -> nodes/pathnodes.
>
> I had a look on this.
>
> The name "index skip scan" is a different feature from the
> feature with the name on other prodcuts, which means "index scan
> with postfix key (of mainly of multi column key) that scans
> ignoring the prefixing part" As Thomas suggested I'd suggest that
> we call it "index hop scan". (I can accept Hopscotch, either:p)
>
> Also as mentioned upthread by Peter Geoghegan, this could easly
> give worse plan by underestimation. So I also suggest that this
> has dynamic fallback function. In such perspective it is not
> suitable for AM API level feature.
>
> If all leaf pages are on the buffer and the average hopping
> distance is less than (expectedly) 4 pages (the average height of
> the tree), the skip scan will lose. If almost all leaf pages are
> staying on disk, we could win only by 2-pages step (skipping over
> one page).
>
> =====
> As I'm writing the above, I came to think that it's better
> implement this as an pure executor optimization.
>
> Specifically, let _bt_steppage count the ratio of skipped pages
> so far then if the ratio exceeds some threshold (maybe around
> 3/4) it gets into hopscotching mode, where it uses index scan to
> find the next page (rather than traversing). As mentioned above,
> I think using skip scan to go beyond the next page is a good
> bet. If the success ration of skip scans goes below some
> threshold (I'm not sure for now), we should fall back to
> traversing.
>
> Any opinions?
>
> ====
>
> Some comments on the patch below.
>
> + skip scan approach, which is based on the idea of
> + <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
> + Loose index scan</ulink>. Rather than scanning all equal values of a key,
> + as soon as a new value is found, it will search for a larger value on the
>
> I'm not sure it is a good thing to put a pointer to rather
> unstable source in the documentation.
>
>
> This adds a new AM method but it seems avaiable only for ordered
> indexes, specifically btree. And it seems that the feature can be
> implemented in btgettuple since btskip apparently does the same
> thing. (I agree to Robert in the point in [1]).
>
> [1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com
>
>
> Related to the above, it seems better that the path generation of
> skip scan is a part of index scan. Whether skip scan or not is a
> matter of index scan itself.
>

Thanks for your valuable feedback ! And, calling it "Loose index scan"
or something else is better.

Dmitry and I will look at this and take it into account for the next
version.

For now, I have switched the CF entry to WoA.

Thanks again !

Best regards,
Jesper

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, jesper(dot)pedersen(at)redhat(dot)com, a(dot)kuzmenkov(at)postgrespro(dot)ru, Peter Geoghegan <pg(at)bowt(dot)ie>, thomas(dot)munro(at)enterprisedb(dot)com, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-02-01 21:04:58
Message-ID:	CAAaqYe_yfALkVzRggdWwxLLB8C=BFSLGCcJZh=ZvYCc8+r2CqA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Also as mentioned upthread by Peter Geoghegan, this could easly
> give worse plan by underestimation. So I also suggest that this
> has dynamic fallback function. In such perspective it is not
> suitable for AM API level feature.
>
> If all leaf pages are on the buffer and the average hopping
> distance is less than (expectedly) 4 pages (the average height of
> the tree), the skip scan will lose. If almost all leaf pages are
> staying on disk, we could win only by 2-pages step (skipping over
> one page).
>
> =====
> As I'm writing the above, I came to think that it's better
> implement this as an pure executor optimization.
>
> Specifically, let _bt_steppage count the ratio of skipped pages
> so far then if the ratio exceeds some threshold (maybe around
> 3/4) it gets into hopscotching mode, where it uses index scan to
> find the next page (rather than traversing). As mentioned above,
> I think using skip scan to go beyond the next page is a good
> bet. If the success ration of skip scans goes below some
> threshold (I'm not sure for now), we should fall back to
> traversing.
>
> Any opinions?

Hi!

I'd like to offer a counterpoint: in cases where this a huge win we
definitely do want this to affect cost estimation, because if it's
purely an executor optimization the index scan path may not be chosen
even when skip scanning would be a dramatic improvement.

I suppose that both requirements could be met by incorporating it into
the existing index scanning code and also modifying to costing to
(only when we have high confidence?) account for the optimization. I'm
not sure if that makes things better than the current state of the
patch or not.

James Coleman

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	James Coleman <jtc331(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, jesper(dot)pedersen(at)redhat(dot)com, a(dot)kuzmenkov(at)postgrespro(dot)ru, Peter Geoghegan <pg(at)bowt(dot)ie>, thomas(dot)munro(at)enterprisedb(dot)com, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-02-01 22:05:03
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-02-01 16:04:58 -0500, James Coleman wrote:
> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Also as mentioned upthread by Peter Geoghegan, this could easly
> > give worse plan by underestimation. So I also suggest that this
> > has dynamic fallback function. In such perspective it is not
> > suitable for AM API level feature.
> >
> > If all leaf pages are on the buffer and the average hopping
> > distance is less than (expectedly) 4 pages (the average height of
> > the tree), the skip scan will lose. If almost all leaf pages are
> > staying on disk, we could win only by 2-pages step (skipping over
> > one page).
> >
> > =====
> > As I'm writing the above, I came to think that it's better
> > implement this as an pure executor optimization.
> >
> > Specifically, let _bt_steppage count the ratio of skipped pages
> > so far then if the ratio exceeds some threshold (maybe around
> > 3/4) it gets into hopscotching mode, where it uses index scan to
> > find the next page (rather than traversing). As mentioned above,
> > I think using skip scan to go beyond the next page is a good
> > bet. If the success ration of skip scans goes below some
> > threshold (I'm not sure for now), we should fall back to
> > traversing.
> >
> > Any opinions?
>
> I'd like to offer a counterpoint: in cases where this a huge win we
> definitely do want this to affect cost estimation, because if it's
> purely an executor optimization the index scan path may not be chosen
> even when skip scanning would be a dramatic improvement.

+many.

Greetings,

Andres Freund

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-02-20 16:35:08
Message-ID:	CA+q6zcXAaMJBr-rtnOhUxzQz9ZYQYK9sSYnVWpjU=dJaU5e_QA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> Dmitry and I will look at this and take it into account for the next
> version.

In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before. And talking about
this topic I wanted to ask to clarify a few points, since looks like I'm
missing something:

One of not yet addressed points in this patch is amcanbackward. From the
historical thread, mentioned in the first email:

> On 2016-11-25 at 01:33 Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> + if (ScanDirectionIsForward(dir))
> + {
> + so->currPos.moreLeft = false;
> + so->currPos.moreRight = true;
> + }
> + else
> + {
> + so->currPos.moreLeft = true;
> + so->currPos.moreRight = false;
> + }
>
>
>
> The lack of comments makes it hard for me to understand what the
> motivation for this is, but I bet it's wrong. Suppose there's a
> cursor involved here and the user tries to back up. Instead of having
> a separate amskip operation, maybe there should be a flag attached to
> a scan indicating whether it should return only distinct results.
> Otherwise, you're allowing for the possibility that the same scan
> might sometimes skip and other times not skip, but then it seems hard
> for the scan to survive cursor operations. Anyway, why would that be
> useful?

I assume that "sometimes skip and other times not skip" refers to the
situation, when we did fetch forward and jump something over, and then right
away doing fetch backwards, when we don't actually need to skip anything and
can get the result right away, right? If so, I can't find any comments about
why is it should be a problem for cursor operations?

Attachment	Content-Type	Size
v9-0001-Index-skip-scan.patch	application/octet-stream	36.7 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-02-28 21:45:15
Message-ID:	CAMkU=1x0984uC7PuupnnF6rVeMg6q3OwhU8k906mVZODstaGgQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 20, 2019 at 11:33 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
wrote:

> > On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <
> jesper(dot)pedersen(at)redhat(dot)com> wrote:
> >
> > Dmitry and I will look at this and take it into account for the next
> > version.
>
> In the meantime, just to not forget, I'm going to post another version
> with a
> fix for cursor fetch backwards, which was crashing before.

This version of the patch can return the wrong answer.

create index on pgbench_accounts (bid, aid);
begin; declare c cursor for select distinct on (bid) bid, aid from
pgbench_accounts order by bid, aid;
fetch 2 from c;
bid | aid
-----+---------
1 | 1
2 | 100,001

fetch backward 1 from c;
bid | aid
-----+---------
1 | 100,000

Without the patch, instead of getting a wrong answer, I get an error:

ERROR: cursor can only scan forward
HINT: Declare it with SCROLL option to enable backward scan.

If I add "SCROLL", then I do get the right answer with the patch.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-02-28 22:23:15
Message-ID:	CAMkU=1z0TzA8sBEoNvV9YB_W-u4Mf75b7zUiZvuAmoQqVTx1gg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hello.
>
> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
> wrote in <CA+q6zcVP18wYiO=
> aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w(at)mail(dot)gmail(dot)com>
> > A bit of adjustment after nodes/relation -> nodes/pathnodes.
>
> I had a look on this.
>
> The name "index skip scan" is a different feature from the
> feature with the name on other prodcuts, which means "index scan
> with postfix key (of mainly of multi column key) that scans
> ignoring the prefixing part" As Thomas suggested I'd suggest that
> we call it "index hop scan". (I can accept Hopscotch, either:p)
>

I think that what we have proposed here is just an incomplete
implementation of what other products call a skip scan, not a fundamentally
different thing. They don't ignore the prefix part, they use that part in
a way to cancel itself out to give the same answer, but faster. I think
they would also use this skip method to get distinct values if that is what
is requested. But they would go beyond that to also use it to do something
similar to the plan we get with this:

Set up:
pgbench -i -s50
create index on pgbench_accounts (bid, aid);
alter table pgbench_accounts drop constraint pgbench_accounts_pkey ;

Query:
explain analyze with t as (select distinct bid from pgbench_accounts )
select pgbench_accounts.* from pgbench_accounts join t using (bid) where
aid=5;

If we accept this patch, I hope it would be expanded in the future to give
similar performance as the above query does even when the query is written
in its more natural way of:

explain analyze select * from pgbench_accounts where aid=5;

(which currently takes 200ms, rather than the 0.9ms taken for the one
benefiting from skip scan)

I don't think we should give it a different name, just because our initial
implementation is incomplete.

Or do you think our implementation of one feature does not really get us
closer to implementing the other?

Cheers,

Jeff

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-02-28 23:03:06
Message-ID:	CA+hUKGKW4dXTP9G+WBskjT09tzD+9aMWEm=Fpeb6RS5SXfPyKw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w(at)mail(dot)gmail(dot)com>
>> > A bit of adjustment after nodes/relation -> nodes/pathnodes.
>>
>> I had a look on this.
>>
>> The name "index skip scan" is a different feature from the
>> feature with the name on other prodcuts, which means "index scan
>> with postfix key (of mainly of multi column key) that scans
>> ignoring the prefixing part" As Thomas suggested I'd suggest that
>> we call it "index hop scan". (I can accept Hopscotch, either:p)
>
>
> I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this:

Hi Jeff,

"Hop scan" was just a stupid joke that occurred to me when I saw that
DB2 had gone for "jump scan". I think "skip scan" is a perfectly good
name and it's pretty widely used by now (for example, by our friends
over at SQLite to blow us away at these kinds of queries).

Yes, simple distinct value scans are indeed just the easiest kind of
thing to do with this kind of scan-with-fast-forward. As discussed
already in this thread and the earlier one there is a whole family of
tricks you can do, and the thing that most people call an "index skip
scan" is indeed the try-every-prefix case where you can scan an index
on (a, b) given a WHERE clause b = x. Perhaps the simple distinct
scan could appear as "Distinct Index Skip Scan"? And perhaps the
try-every-prefix-scan could appear as just "Index Skip Scan"? Whether
these are the same executor node is a good question; at one point I
proposed a separate nest-loop like node for the try-every-prefix-scan,
but Robert shot that down pretty fast. I now suspect (as he said)
that all of this belongs in the index scan node, as different modes.
The behaviour is overlapping; for "Distinct Index Skip Scan" you skip
to each distinct prefix and emit one tuple, whereas for "Index Skip
Scan" you skip to each distinct prefix and then perform a regular scan
for the prefix + the suffix emitting matches.

> (which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan)

Nice.

> I don't think we should give it a different name, just because our initial implementation is incomplete.

> Or do you think our implementation of one feature does not really get us closer to implementing the other?

My question when lobbing the earlier sketch patches into the mailing
list a few years back was: is this simple index AM interface and
implementation (once debugged) powerful enough for various kinds of
interesting skip-based plans? So far I have the impression that it
does indeed work for Distinct Index Skip Scan (demonstrated), Index
Skip Scan (no one has yet tried that), and special cases of extrema
aggregate queries (foo, MIN(bar) can be performed by skip scan of
index on (foo, bar)), but may not work for the semi-related merge join
trickery mentioned in a paper posted some time back (though I don't
recall exactly why). Another question is whether it should all be
done by the index scan node, and I think the answer is yet.

--
Thomas Munro
https://enterprisedb.com

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-02-28 23:10:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/1/19 12:03 AM, Thomas Munro wrote:
> On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w(at)mail(dot)gmail(dot)com>
>>>> A bit of adjustment after nodes/relation -> nodes/pathnodes.
>>>
>>> I had a look on this.
>>>
>>> The name "index skip scan" is a different feature from the
>>> feature with the name on other prodcuts, which means "index scan
>>> with postfix key (of mainly of multi column key) that scans
>>> ignoring the prefixing part" As Thomas suggested I'd suggest that
>>> we call it "index hop scan". (I can accept Hopscotch, either:p)
>>
>>
>> I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this:
>
> Hi Jeff,
>
> "Hop scan" was just a stupid joke that occurred to me when I saw that
> DB2 had gone for "jump scan". I think "skip scan" is a perfectly good
> name and it's pretty widely used by now (for example, by our friends
> over at SQLite to blow us away at these kinds of queries).
>

+1 to "hop scan"

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-03-05 15:05:42
Message-ID:	CA+q6zcUSMQtDVr-isZkm8FP4Mh6APkNyzkPYQPr1ZugeDV2XPA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> This version of the patch can return the wrong answer.

Yes, indeed. In fact it answers my previous question related to the backward
cursor scan, when while going back we didn't skip enough. Within the current
approach it can be fixed by proper skipping for backward scan, something like
in the attached patch.

Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.

> If we accept this patch, I hope it would be expanded in the future to give
> similar performance as the above query does even when the query is written in
> its more natural way of:

Yeah, I hope the current approach with a new index am routine can be extended
for that.

> Without the patch, instead of getting a wrong answer, I get an error:

Right, as far as I can see without a skip scan and SCROLL, a unique + index
scan is used, where amcanbackward is false by default. So looks like it's not
really patch related.

Attachment	Content-Type	Size
v10-0001-Index-skip-scan.patch	application/octet-stream	39.1 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-03-14 13:32:49
Message-ID:	CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> Although there are still some rough edges, e.g. going forth, back and forth
> again leads to a sutiation, when `_bt_first` is not applied anymore and the
> first element is wrongly skipped. I'll try to fix it with the next version of
> patch.

It turns out that `_bt_skip` was unnecessary applied every time when scan was
restarted from the beginning. Here is the fixed version of patch.

Attachment	Content-Type	Size
v11-0001-Index-skip-scan.patch	application/octet-stream	39.3 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	jeff(dot)janes(at)gmail(dot)com, thomas(dot)munro(at)gmail(dot)com, jesper(dot)pedersen(at)redhat(dot)com, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2019-03-15 00:51:57
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg(at)mail(dot)gmail(dot)com>
> > On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> >
> > Although there are still some rough edges, e.g. going forth, back and forth
> > again leads to a sutiation, when `_bt_first` is not applied anymore and the
> > first element is wrongly skipped. I'll try to fix it with the next version of
> > patch.
>
> It turns out that `_bt_skip` was unnecessary applied every time when scan was
> restarted from the beginning. Here is the fixed version of patch.

> nbtsearch.c: In function ‘_bt_skip’:
> nbtsearch.c:1292:11: error: ‘struct IndexScanDescData’ has no member named ‘xs_ctup’; did you mean ‘xs_itup’?
> scan->xs_ctup.t_self = currItem->heapTid;

Unfortunately a recent commit c2fe139c20 hit this.

Date: Mon Mar 11 12:46:41 2019 -0700
> Index scans now store the result of a search in
> IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
> target is not generally a HeapTuple anymore that seems cleaner.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	jeff(dot)janes(at)gmail(dot)com, thomas(dot)munro(at)gmail(dot)com, jesper(dot)pedersen(at)redhat(dot)com, a(dot)kuzmenkov(at)postgrespro(dot)ru, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru, jtc331(at)gmail(dot)com
Subject:	Re: Index Skip Scan
Date:	2019-03-15 03:54:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

I have some comments on the latest v11 patch.

L619:
> + indexstate->ioss_NumDistinctKeys = node->distinctPrefix;

The number of distinct prefix keys has various names in this
patch. They should be unified as far as possible.

L:728
> + root->distinct_pathkeys > 0)

It is not an integer, but a list.

L730:
> + Path *subpath = (Path *)
> + create_skipscan_unique_path(root,

The name "subpath" here is not a subpath, but it can be removed
by directly calling create_skipscan_unique_path in add_path.

L:758
> +create_skipscan_unique_path(PlannerInfo *root,
> + RelOptInfo *rel,
> + Path *subpath,
> + int numCols,

The "subpath" is not a subpath. How about basepath, or orgpath?
The "numCols" doesn't makes clear sense. unique_prefix_keys?

L764:
> + IndexPath *pathnode = makeNode(IndexPath);
> +
> + Assert(IsA(subpath, IndexPath));
> +
> + /* We don't want to modify subpath, so make a copy. */
> + memcpy(pathnode, subpath, sizeof(IndexPath));

Why don't you just use copyObject()?

L773:
> + Assert(numCols > 0);

Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?

L586:
> + * Check if we need to skip to the next key prefix, because we've been
> + * asked to implement DISTINCT.
> + */
> + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
> + {
> + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> + {
> + /* Reached end of index. At this point currPos is invalidated,

I thought a while on this bit. It seems that the lower layer must
know whether it has emitted the first tuple. So I think that this
code can be reduced as the follows.

> if (node->ioss_NumDistinctKeys &&
> !index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> return ExecClearTupler(slot);

Then, the index_skip returns true with doing nothing if the
scandesc is in the initial state. (Of course other index AMs can
do something in the first call.) ioss_FirstTupleEmitted and the
comment can be removed.

By the way this patch seems to still be forgetting about the
explicit rescan case but just doing this makes such consideration
not required.

L1032:
> + Index Only Scan using tenk1_four on public.tenk1
> + Output: four
> + Scan mode: Skip scan

The "Scan mode" has only one value and it is shown only for
"Index Only Scan" case. It seems to me that "Index Skip Scan"
implies Index Only Scan. How about just "Index Skip Scan"?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-03-16 16:14:20
Message-ID:	CA+q6zcXcSftAxfwuK5c186E8ckZeWmOO07GFvpoOx-wpsDGGzA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I have some comments on the latest v11 patch.

Thank you!

> L619:
> > + indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
>
> The number of distinct prefix keys has various names in this
> patch. They should be unified as far as possible.

Good point, I've renamed everything to skipPrefixSize, it seems for me that
this name should be self explanatory enough.

> L:728
> > + root->distinct_pathkeys > 0)
>
> It is not an integer, but a list.

Thanks for noticing, fixed (via compare with NIL, since we just need to know if
this list is empty or not).

> L730:
> > + Path *subpath = (Path *)
> > + create_skipscan_unique_path(root,
>
> The name "subpath" here is not a subpath, but it can be removed
> by directly calling create_skipscan_unique_path in add_path.
>
>
> L:758
> > +create_skipscan_unique_path(PlannerInfo *root,
> > + RelOptInfo *rel,
> > + Path *subpath,
> > + int numCols,
>
> The "subpath" is not a subpath. How about basepath, or orgpath?
> The "numCols" doesn't makes clear sense. unique_prefix_keys?

I agree, suggested names sound good.

> L764:
> > + IndexPath *pathnode = makeNode(IndexPath);
> > +
> > + Assert(IsA(subpath, IndexPath));
> > +
> > + /* We don't want to modify subpath, so make a copy. */
> > + memcpy(pathnode, subpath, sizeof(IndexPath));
>
> Why don't you just use copyObject()?

Maybe I'm missing something, but I don't see that copyObject works with path
nodes, does it? I've tried it with subpath directly and got `unrecognized node
type`.

> L773:
> > + Assert(numCols > 0);
>
> Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?

Yeah, makes sense.

> L586:
> > + * Check if we need to skip to the next key prefix, because we've been
> > + * asked to implement DISTINCT.
> > + */
> > + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
> > + {
> > + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> > + {
> > + /* Reached end of index. At this point currPos is invalidated,
>
> I thought a while on this bit. It seems that the lower layer must
> know whether it has emitted the first tuple. So I think that this
> code can be reduced as the follows.
>
> > if (node->ioss_NumDistinctKeys &&
> > !index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> > return ExecClearTupler(slot);
>
> Then, the index_skip returns true with doing nothing if the
> scandesc is in the initial state. (Of course other index AMs can
> do something in the first call.) ioss_FirstTupleEmitted and the
> comment can be removed.

I'm not sure then, how to figure out when scandesc is in the initial state from
the inside index_skip without passing the node as an argument? E.g. in the
case, describe in the commentary, when we do fetch forward/fetch backward/fetch
forward again.

> L1032:
> > + Index Only Scan using tenk1_four on public.tenk1
> > + Output: four
> > + Scan mode: Skip scan
>
> The "Scan mode" has only one value and it is shown only for
> "Index Only Scan" case. It seems to me that "Index Skip Scan"
> implies Index Only Scan. How about just "Index Skip Scan"?

Do you mean, show "Index Only Scan", and then "Index Skip Scan" in details,
instead of "Scan mode", right?

Attachment	Content-Type	Size
v12-0001-Index-skip-scan.patch	application/octet-stream	39.5 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-03-19 13:07:32
Message-ID:	CA+q6zcVn10QzbSh686PaZpxgxMhxABRj7aJJpsJ=tKkmcSrBew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I have some comments on the latest v11 patch.
>
> Thank you!

In the meantime here is a new version, rebased after tableam changes.

Attachment	Content-Type	Size
v13-0001-Index-skip-scan.patch	application/octet-stream	39.5 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-03-28 10:01:24
Message-ID:	CA+q6zcUVAnfSOaSk3s9Sbj6yC7LrDYChSyJd0t7u6Q=97k5a=A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, Mar 19, 2019 at 2:07 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> >
> > > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I have some comments on the latest v11 patch.
> >
> > Thank you!
>
> In the meantime here is a new version, rebased after tableam changes.

Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
mechanical, just to make it work - I guess I'll need to try to rewrite some
parts of the patch, that don't look natural now, accordingly. And maybe to
leverage dynamic prefix truncation per Peter suggestion.

Attachment	Content-Type	Size
v14-0001-Index-skip-scan.patch	application/octet-stream	39.7 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-05-11 16:35:41
Message-ID:	CA+q6zcVQG0ufA_5m_7eeHLN2VRy+z8Kf9FoZ2NnF=yd0_43ndg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Mar 28, 2019 at 11:01 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
> mechanical, just to make it work - I guess I'll need to try to rewrite some
> parts of the patch, that don't look natural now, accordingly.

Here is the updated version with the changes I was talking about (mostly about
readability and code cleanup). I've also added few tests for a cursor behaviour.

Attachment	Content-Type	Size
v15-0001-Index-skip-scan.patch	application/octet-stream	40.3 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-05-29 15:50:51
Message-ID:	CA+q6zcXxwU1FZ=Tj0kWeTkrJ1N9=fhrUH1t5N_yHEcKsG=+qUw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, May 11, 2019 at 6:35 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> Here is the updated version with the changes I was talking about (mostly about
> readability and code cleanup). I've also added few tests for a cursor behaviour.

And one more cosmetic rebase after pg_indent.

Attachment	Content-Type	Size
v16-0001-Index-skip-scan.patch	application/octet-stream	40.4 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 04:01:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

After some talks with Jesper at PGCon about the Index Skip Scan, I started testing this patch, because it seems to have great potential in speeding up many of our queries (great conference by the way, really enjoyed my first time being there!). I haven't looked in depth to the code itself, but I focused on some testing with real data that we have.

Let me start by sketching our primary use case for this, as it is similar, but slightly different than what was discussed earlier in this thread. I think this use case is something a lot of people who handle timeseries data have. Our database has many tables with timeseries data. We don't update rows, but just insert new rows each time. One example of this would be a table with prices for instruments. Instruments are identified by a column called feedcode. Prices of instrument update with a certain frequency. Each time it updates we insert a new row with the new value and the timestamp at that time. So in the simplest form, you could see it as a table like this:

create table prices (feedcode text, updated_at timestamptz, value float8); -- there'll be some other columns as well, this is just an example

create unique index on prices (feedcode, updated_at desc);

This table perfectly fits the criteria for the index skip scan as there are relatively few distinct feedcodes, but each of them has quite a lot of price inserts (imagine 1000 distinct feedcodes, each of them having one price per second). We normally partition our data by a certain time interval, so let's say we're just looking at one day of prices here. We have other cases with higher update frequencies and/or more distinct values though.

Typical queries on this table involve querying the price at a certain point in time, or simply querying the latest update. If we know the feedcode, this is easy:

select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1

Unfortunately, this gets hard if you want to know the price of everything at a certain point in time. The query then becomes:

select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc

Up until now (even with this patch) this uses a regular index scan + a unique node which scans the full index, which is terribly slow and is also not constant - as the table grows it becomes slower and slower.

Obviously there are currently already ways to speed this up using the recursive loose index scan, but I think everybody agrees that those are pretty unreadable. However, since they're also several orders of magnitude faster, we actually use them everywhere. Eg.

-- certain point in time

-- first query *all* distinct feedcodes (disregarding time), then look do an index scan for every feedcode found to see if it has an update in the time window we're interested in

-- this essentially means we're doing 2*N index scans

with recursive t as (
select feedcode from prices order by feedcode, updated_at desc limit 1
union all
select n.feedcode from t
cross join lateral (select feedcode from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) n
) select n.* from t
cross join lateral (select * from prices where feedcode=t.feedcode and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1) n

-- just latest value
-- if we're interested in just the latest value, it can actually be optimized to just N index scans like this.
-- to make it even more confusing - there's a tradeoff here.. if you're querying a timestamp close to the latest available timestamp, it is often faster to use this method anyway and just put the filter for updated_at inside this query. this avoids the overhead of 2*N index scans, at the expense that the LIMIT 1 may have to scan several tuples before finding one that matches the timestamp criteria. With the 2*N method above we're guaranteed that the first tuple it sees is the correct tuple, but we're doing many more scans...
with recursive t as (
select * from prices order by feedcode, updated_at desc limit 1
union all
select n.* from t
cross join lateral (select * from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) _
) select * from t

I hope this makes our current situation clear. Please do ask if I need to elaborate on something here.

So what changes with this patch? The great thing is that the recursive CTE is not required anymore! This is a big win for readability and it helps performance as well. It makes everything much better. I am really happy with these results. If the index skip scan triggers, it is easily over 100x faster than the naive 'distinct on' query in earlier versions of Postgres. It is also quite a bit faster than the recursive CTE version of the query.

I have a few remarks though. I tested some of our queries with the patch and found that the following query would (with patch) work best now for arbitrary timestamps:

-- first query all distinct values using the index skip scan, then do an index scan for each of these

select r.* from (
select distinct feedcode from prices
) k
cross join lateral (
select *
from prices
where feedcode=k.feedcode and updated_at <= '2019-06-01 12:00'
order by feedcode, updated_at desc
limit 1
) r?

While certainly a big improvement over the recursive CTE, it would be nice if the even simpler form with the 'distinct on' would work out of the box using an index skip scan.

select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc

As far as I can see, there are two main problems with that at the moment.

1) Only support for Index-Only scan at the moment, not for regular index scans. This was already mentioned upthread and I can understand that it was left out until now to constrain the scope of this. However, if we were to support 'distinct on' + selecting columns that are not part of the index we need a regular index scan instead of the index only scan.

2) The complicating part that we're interested in the value 'at a specific point in time'. This comparison with updated_at messes up all efficiency, as the index scan first looks at the *latest* updated_at for a certain feedcode and then walks the tree until it finds a tuple that matches the updated_at criteria (which possibly never happens, in which case it will happily walk over the full index). I'm actually unsure if there is anything we can do about this (without adding a lot of complexity), aside from just rewriting the query itself in the way I did where we're doing the skip scan for all distinct items followed by a second round of index scans for the specific point in time. I'm struggling a bit to explain this part clearly - I hope it's clear though. Please let me know if I should elaborate. Perhaps it's easiest to see by the difference in speed between the following two queries:

select distinct feedcode from prices -- approx 10ms

select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms

Both use the index skip scan, but the first one is very fast, because it can skip large parts of the index. The second one scans the full index, because it never finds any row that matches the where condition so it can never skip anything.

Thanks for this great patch. It is already very useful and ?fills a gap that has existed for a long time. It is going to make our queries so much more readable and performant if we won't have to resort to recursive CTEs anymore!

-Floris

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 04:10:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?

Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.

The following query uses an index skip scan and returns just 1 row (incorrect!)

select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'

The following query still uses the regular index scan and returns many more rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'

It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.

-Floris

From:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 10:03:30
Message-ID:	CA+FpmFeKh9BAhDTAf-ZWQ=pvVaRy+Lnahb_8=OOFqcVXJDBWWQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 1 Jun 2019 at 06:10, Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?
>
I am willing to give it a try.
> Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
>
>
> The following query uses an index skip scan and returns just 1 row (incorrect!)
>
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
>
> The following query still uses the regular index scan and returns many more rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'
>
Aren't those two queries different?
select distinct on (market, feedcode) market, feedcode vs select
distinct on (market, feedcode)*
Anyhow, it's just the difference in projection so doesn't matter much.
I verified this scenario at my end and you are right, there is a bug.
Here is my repeatable test case,

create table t (market text, feedcode text, updated_at timestamptz,
value float8) ;
create index on t (market, feedcode, updated_at desc);
insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
'2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
insert into t values('TEST', 'jsgfhdfjd', (select timestamp
'2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);

Now, without the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+-----------
TEST | abcdef
TEST | jsgfhdfjd
(2 rows)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
----------------------------------------------------------------
Unique (cost=12.20..13.21 rows=2 width=13)
-> Sort (cost=12.20..12.70 rows=201 width=13)
Sort Key: feedcode
-> Seq Scan on t (cost=0.00..4.51 rows=201 width=13)
Filter: (market = 'TEST'::text)
(5 rows)

And with the patch,
select distinct on (market, feedcode) market, feedcode from t where
market='TEST';
market | feedcode
--------+----------
TEST | abcdef
(1 row)

explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN
------------------------------------------------------------------------------------------------
Index Only Scan using t_market_feedcode_updated_at_idx on t
(cost=0.14..0.29 rows=2 width=13)
Scan mode: Skip scan
Index Cond: (market = 'TEST'::text)
(3 rows)

Notice that in the explain statement it shows correct number of rows
to be skipped.

--
Regards,
Rafia Sabih

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 10:28:28
Message-ID:	CA+q6zcU=8OEKw8bc2DHsf7xSFqaT7gOjc-_nbX8FO1N4eizQHg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Jun 1, 2019 at 6:10 AM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> After some talks with Jesper at PGCon about the Index Skip Scan, I started
> testing this patch, because it seems to have great potential in speeding up
> many of our queries (great conference by the way, really enjoyed my first
> time being there!). I haven't looked in depth to the code itself, but I
> focused on some testing with real data that we have.

Thanks!

> Actually I'd like to add something to this. I think I've found a bug in the
> current implementation. Would someone be able to check?
>
> The following query uses an index skip scan and returns just 1 row (incorrect!)
>
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
>
> The following query still uses the regular index scan and returns many more
> rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'

Yes, good catch, I'll investigate. Looks like in the current implementation
something is not quite right, when we have this order of columns in an index
and where clause (e.g. in the examples above everything seems fine if we create
index over (feedcode, market) and not the other way around).

> As far as I can see, there are two main problems with that at the moment.
>
> 1) Only support for Index-Only scan at the moment, not for regular index
> scans. This was already mentioned upthread and I can understand that it
> was left out until now to constrain the scope of this. However, if we were
> to support 'distinct on' + selecting columns that are not part of the
> index we need a regular index scan instead of the index only scan.

Sure, it's something I hope we can tackle as the next step.

> select distinct feedcode from prices -- approx 10ms
>
> select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms
>
> Both use the index skip scan, but the first one is very fast, because it can
> skip large parts of the index. The second one scans the full index, because
> it never finds any row that matches the where condition so it can never skip
> anything.

Interesting, I'll take a closer look.

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Jesper Pedersen" <jesper(dot)pedersen(at)redhat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 15:33:57
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Thanks for the helpful replies.

> Yes, good catch, I'll investigate. Looks like in the current implementation
> something is not quite right, when we have this order of columns in an index
> and where clause (e.g. in the examples above everything seems fine if we create
> index over (feedcode, market) and not the other way around).

I did a little bit of investigation and it seems to occur because in pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if there is an equality condition with a constant in the corresponding WHERE clause.

* 1. If the new pathkey's equivalence class contains a constant, and isn't
* below an outer join, then we can disregard it as a sort key. An example:
* SELECT ... WHERE x = 42 ORDER BY x, y;

In planner.c it builds the list of distinct_pathkeys, which is then used for the index skip scan to skip over the first length(distinct_pathkeys) columns when it does a skip. In my query, the distinct_pathkeys list only contains 'feedcode' and not 'market', because 'market' was considered redundant due to the WHERE clause. However, the index skip scan interprets this as that it has to skip over just the first column.
We need to get this list of number of prefix columns to skip differently while building the plan. We need the 'real' number of distinct keys without throwing away the redundant ones. However, I'm not sure if this information can still be obtained while calling create_skipscan_unique_path? But I'm sure people here will have much better ideas than me about this :-)

-Floris

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-01 16:57:31
Message-ID:	CA+q6zcVJXRnsLXDK21knNnUpebWuTA-rpj5vHToA9Tyw5N+qaw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> I did a little bit of investigation and it seems to occur because in
> pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
> there is an equality condition with a constant in the corresponding WHERE
> clause.
> ...
> However, the index skip scan interprets this as that it has to skip over just
> the first column.

Right, passing correct number of columns fixes this particular problem. But
while debugging I've also discovered another related issue, when the current
implementation seems to have a few assumptions, that are not correct if we have
an index condition and a distinct column is not the first in the index. I'll
try to address these in a next version of the patch in the nearest future.

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-03 18:13:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Floris,

On 6/1/19 12:10 AM, Floris Van Nee wrote:
> Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
>
>
> The following query uses an index skip scan and returns just 1 row (incorrect!)
>
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
>
> The following query still uses the regular index scan and returns many more rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'
>
>
> It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.
>
>

Thanks for taking a look at the patch, and your feedback on it.

I'll def look into this once I'm back from my travels.

Best regards,
Jesper

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-03 18:16:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Rafia,

On 6/1/19 6:03 AM, Rafia Sabih wrote:
> Here is my repeatable test case,
>
> create table t (market text, feedcode text, updated_at timestamptz,
> value float8) ;
> create index on t (market, feedcode, updated_at desc);
> insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
> 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
> '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
> insert into t values('TEST', 'jsgfhdfjd', (select timestamp
> '2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
> timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
>
> Now, without the patch,
> select distinct on (market, feedcode) market, feedcode from t where
> market='TEST';
> market | feedcode
> --------+-----------
> TEST | abcdef
> TEST | jsgfhdfjd
> (2 rows)
> explain select distinct on (market, feedcode) market, feedcode from t
> where market='TEST';
> QUERY PLAN
> ----------------------------------------------------------------
> Unique (cost=12.20..13.21 rows=2 width=13)
> -> Sort (cost=12.20..12.70 rows=201 width=13)
> Sort Key: feedcode
> -> Seq Scan on t (cost=0.00..4.51 rows=201 width=13)
> Filter: (market = 'TEST'::text)
> (5 rows)
>
> And with the patch,
> select distinct on (market, feedcode) market, feedcode from t where
> market='TEST';
> market | feedcode
> --------+----------
> TEST | abcdef
> (1 row)
>
> explain select distinct on (market, feedcode) market, feedcode from t
> where market='TEST';
> QUERY PLAN
> ------------------------------------------------------------------------------------------------
> Index Only Scan using t_market_feedcode_updated_at_idx on t
> (cost=0.14..0.29 rows=2 width=13)
> Scan mode: Skip scan
> Index Cond: (market = 'TEST'::text)
> (3 rows)
>
> Notice that in the explain statement it shows correct number of rows
> to be skipped.
>

Thanks for your test case; this is very helpful.

For now, I would like to highlight that

SET enable_indexskipscan = OFF

can be used for testing with the patch applied.

Dmitry and I will look at the feedback provided.

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-03 20:31:33
Message-ID:	CA+q6zcVnRRqktw4cg2Vddy3S+VY3Tsmit0kausYtctvsTum7tA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Jun 1, 2019 at 6:57 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> >
> > I did a little bit of investigation and it seems to occur because in
> > pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
> > there is an equality condition with a constant in the corresponding WHERE
> > clause.
> > ...
> > However, the index skip scan interprets this as that it has to skip over just
> > the first column.
>
> Right, passing correct number of columns fixes this particular problem. But
> while debugging I've also discovered another related issue, when the current
> implementation seems to have a few assumptions, that are not correct if we have
> an index condition and a distinct column is not the first in the index. I'll
> try to address these in a next version of the patch in the nearest future.

So, as mentioned above, there were a few problems, namely the number of
distinct_pathkeys with and without redundancy, and using _bt_search when the
order of distinct columns doesn't match the index. As far as I can see the
problem in the latter case (when we have an index condition) is that it's still
possible to find a value, but lastItem value after the search is always zero
(due to _bt_checkkeys filtering) and _bt_next stops right away.

To address this, probably we can do something like in the attached patch.
Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
the same, but without the constants elimination. It's being used then for
getting the real number of distinct keys, and to check the order of the columns
to not consider index skip scan if it's different. Hope it doesn't
look too hacky.

Also I've noticed, that the current implementation wouldn't work e.g. for:

select distinct a, a from table;

because in this case an IndexPath is hidden behind a ProjectionPath. For now I
guess it's fine, but probably it's possible here to apply skip scan too.

Attachment	Content-Type	Size
v17-0001-Index-skip-scan.patch	application/octet-stream	50.8 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Alexander Kuzmenkov" <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-05 19:39:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> To address this, probably we can do something like in the attached patch.
> Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
> the same, but without the constants elimination. It's being used then for
> getting the real number of distinct keys, and to check the order of the columns
> to not consider index skip scan if it's different. Hope it doesn't
> look too hacky.
>

Thanks! I've verified that it works now.
I was wondering if we're not too strict in some cases now though. Consider the following queries:

postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_m_f_t_idx on t (cost=0.29..11.60 rows=40 width=5) (actual time=0.056..0.469 rows=10 loops=1)
Scan mode: Skip scan
Index Cond: (m = 'M2'::text)
Heap Fetches: 10
Planning Time: 0.095 ms
Execution Time: 0.490 ms
(6 rows)

postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 loops=1)
-> Index Only Scan using t_m_f_t_idx on t (cost=0.29..824.70 rows=10052 width=5) (actual time=0.087..8.524 rows=10000 loops=1)
Index Cond: (m = 'M2'::text)
Heap Fetches: 10000
Planning Time: 0.078 ms
Execution Time: 10.944 ms
(6 rows)

This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.

-Floris?

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Floris Van Nee <florisvannee(at)Optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-13 16:31:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 6/5/19 3:39 PM, Floris Van Nee wrote:
> Thanks! I've verified that it works now.

Here is a rebased version.

> I was wondering if we're not too strict in some cases now though. Consider the following queries:

[snip]

> This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.
>

I think it would be good to get more feedback on the patch in general
before looking at further optimizations. We should of course fix any
bugs that shows up.

Thanks for your testing and feedback !

Best regards,
Jesper

Attachment	Content-Type	Size
v18-0001-Index-skip-scan.patch	text/x-patch	51.1 KB

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	jesper(dot)pedersen(at)redhat(dot)com
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-14 03:40:54
Message-ID:	CAAaqYe-kZwjA3bWWOVC_GRyo6yzrrK=YoNM36n2un4oh+TYBpA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've previously noted upthread (along with several others), that I
don't see a good reason to limit this new capability to index only
scans. In addition to the reasons upthread, this also prevents using
the new feature on physical replicas since index only scans require
visibility map (IIRC) information that isn't safe to make assumptions
about on a replica.

That being said, it strikes me that this likely indicates an existing
architecture issue. I was discussing the problem at PGCon with Andres
and Heiki with respect to an index scan variation I've been working on
myself. In short, it's not clear to me why we want index only scans
and index scans to be entirely separate nodes, rather than optional
variations within a broader index scan node. The problem becomes even
more clear as we continue to add additional variants that lie on
different axis, since we end up with an ever multiplying number of
combinations.

In that discussion no one could remember why it'd been done that way,
but I'm planning to try to find the relevant threads in the archives
to see if there's anything in particular blocking combining them.

I generally dislike gating improvements like this on seemingly
tangentially related refactors, but I will make the observation that
adding the skip scan on top of such a refactored index scan node would
make this a much more obvious and complete win.

As I noted to Jesper at PGCon I'm happy to review the code in detail
also, but likely won't get to it until later this week or next week at
the earliest.

Jesper: Is there anything still on your list of things to change about
the patch? Or would now be a good time to look hard at the code?

James Coleman

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-14 07:19:49
Message-ID:	CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> Here is a rebased version.

Hi Jesper,

I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)

First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.

I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.

The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for. This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for(). The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join. The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.

I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys(). Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg. I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.

Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ ) However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.

Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.

I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:

# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 1000000
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
Scan mode: Skip scan
Heap Fetches: 1000
Planning Time: 5.616 ms
Execution Time: 21.791 ms
(5 rows)

# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.42..50104.43 rows=200 width=12) (actual
time=1.201..555.280 rows=1000 loops=1)
-> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43
rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
loops=1)
Planning Time: 0.102 ms
Execution Time: 555.407 ms
(4 rows)

[1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	James Coleman <jtc331(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-14 17:19:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi James,

On 6/13/19 11:40 PM, James Coleman wrote:
> I've previously noted upthread (along with several others), that I
> don't see a good reason to limit this new capability to index only
> scans. In addition to the reasons upthread, this also prevents using
> the new feature on physical replicas since index only scans require
> visibility map (IIRC) information that isn't safe to make assumptions
> about on a replica.
>
> That being said, it strikes me that this likely indicates an existing
> architecture issue. I was discussing the problem at PGCon with Andres
> and Heiki with respect to an index scan variation I've been working on
> myself. In short, it's not clear to me why we want index only scans
> and index scans to be entirely separate nodes, rather than optional
> variations within a broader index scan node. The problem becomes even
> more clear as we continue to add additional variants that lie on
> different axis, since we end up with an ever multiplying number of
> combinations.
>
> In that discussion no one could remember why it'd been done that way,
> but I'm planning to try to find the relevant threads in the archives
> to see if there's anything in particular blocking combining them.
>
> I generally dislike gating improvements like this on seemingly
> tangentially related refactors, but I will make the observation that
> adding the skip scan on top of such a refactored index scan node would
> make this a much more obvious and complete win.
>

Thanks for your feedback !

> As I noted to Jesper at PGCon I'm happy to review the code in detail
> also, but likely won't get to it until later this week or next week at
> the earliest.
>
> Jesper: Is there anything still on your list of things to change about
> the patch? Or would now be a good time to look hard at the code?
>

It would be valuable to have test cases for your use-cases which works
now, or should work.

I revived Thomas' patch because it covered our use-cases and saw it as a
much needed feature.

Thanks again !

Best regards,
Jesper

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-14 17:24:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi David,

On 6/14/19 3:19 AM, David Rowley wrote:
> I read over this thread a few weeks ago while travelling back from
> PGCon. (I wish I'd read it on the outward trip instead since it would
> have been good to talk about it in person.)
>
> First off. I think this is a pretty great feature. It certainly seems
> worthwhile working on it.
>
> I've looked over the patch just to get a feel for how the planner part
> works and I have a few ideas to share.
>
> The code in create_distinct_paths() I think should work a different
> way. I think it would be much better to add a new field to Path and
> allow a path to know what keys it is distinct for. This sort of goes
> back to an idea I thought about when developing unique joins
> (9c7f5229ad) about an easier way to detect fields that a relation is
> unique for. I've been calling these "UniqueKeys" in a few emails [1].
> The idea was to tag these onto RelOptInfo to mention which columns or
> exprs a relation is unique by so that we didn't continuously need to
> look at unique indexes in all the places that call
> relation_has_unique_index_for(). The idea there was that unique joins
> would know when a join was unable to duplicate rows. If the outer side
> of a join didn't duplicate the inner side, then the join RelOptInfo
> could keep the UniqueKeys from the inner rel, and vice-versa. If both
> didn't duplicate then the join rel would obtain the UniqueKeys from
> both sides of the join. The idea here is that this would be a better
> way to detect unique joins, and also when it came to the grouping
> planner we'd also know if the distinct or group by should be a no-op.
> DISTINCT could be skipped, and GROUP BY could do a group aggregate
> without any sort.
>
> I think these UniqueKeys ties into this work, perhaps not adding
> UniqueKeys to RelOptInfo, but just to Path so that we create paths
> that have UniqueKeys during create_index_paths() based on some
> uniquekeys that are stored in PlannerInfo, similar to how we create
> index paths in build_index_paths() by checking if the index
> has_useful_pathkeys(). Doing it this way would open up more
> opportunities to use skip scans. For example, semi-joins and
> anti-joins could make use of them if the uniquekeys covered the entire
> join condition. With this idea, the code you've added in
> create_distinct_paths() can just search for the cheapest path that has
> the correct uniquekeys, or if none exist then just do the normal
> sort->unique or hash agg. I'm not entirely certain how we'd instruct
> a semi/anti joined relation to build such paths, but that seems like a
> problem that could be dealt with when someone does the work to allow
> skip scans to be used for those.
>
> Also, I'm not entirely sure that these UniqueKeys should make use of
> PathKey since there's no real need to know about pk_opfamily,
> pk_strategy, pk_nulls_first as those all just describe how the keys
> are ordered. We just need to know if they're distinct or not. All
> that's left after removing those fields is pk_eclass, so could
> UniqueKeys just be a list of EquivalenceClass? or perhaps even a
> Bitmapset with indexes into PlannerInfo->ec_classes (that might be
> premature for not since we've not yet got
> https://commitfest.postgresql.org/23/1984/ or
> https://commitfest.postgresql.org/23/2019/ ) However, if we did use
> PathKey, that does allow us to quickly check if the UniqueKeys are
> contained within the PathKeys, since pathkeys are canonical which
> allows us just to compare their memory address to know if two are
> equal, however, if you're storing eclasses we could probably get the
> same just by comparing the address of the eclass to the pathkey's
> pk_eclass.
>
> Otherwise, I think how you're making use of paths in
> create_distinct_paths() and create_skipscan_unique_path() kind of
> contradicts how they're meant to be used.
>

Thank you very much for this feedback ! Will need to revise the patch
based on this.

> I also agree with James that this should not be limited to Index Only
> Scans. From testing the patch, the following seems pretty strange to
> me:
>
> # create table abc (a int, b int, c int);
> CREATE TABLE
> # insert into abc select a,b,1 from generate_Series(1,1000) a,
> generate_Series(1,1000) b;
> INSERT 0 1000000
> # create index on abc(a,b);
> CREATE INDEX
> # explain analyze select distinct on (a) a,b from abc order by a,b; --
> this is fast.
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------
> Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200
> width=8) (actual time=0.260..20.518 rows=1000 loops=1)
> Scan mode: Skip scan
> Heap Fetches: 1000
> Planning Time: 5.616 ms
> Execution Time: 21.791 ms
> (5 rows)
>
>
> # explain analyze select distinct on (a) a,b,c from abc order by a,b;
> -- Add one more column and it's slow.
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
> Unique (cost=0.42..50104.43 rows=200 width=12) (actual
> time=1.201..555.280 rows=1000 loops=1)
> -> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43
> rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
> loops=1)
> Planning Time: 0.102 ms
> Execution Time: 555.407 ms
> (4 rows)
>
> [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r
>

Ok, understood.

I have put the CF entry into "Waiting on Author".

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-16 15:03:36
Message-ID:	CA+q6zcXoYjCtyU-jFU_=p2DL_K3BHbBEEbJHaJTxdvRX3cU-Sg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, Jun 14, 2019 at 9:20 AM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> The code in create_distinct_paths() I think should work a different
> way. I think it would be much better to add a new field to Path and
> allow a path to know what keys it is distinct for. This sort of goes
> back to an idea I thought about when developing unique joins
> (9c7f5229ad) about an easier way to detect fields that a relation is
> unique for. I've been calling these "UniqueKeys" in a few emails [1].
> The idea was to tag these onto RelOptInfo to mention which columns or
> exprs a relation is unique by so that we didn't continuously need to
> look at unique indexes in all the places that call
> relation_has_unique_index_for(). The idea there was that unique joins
> would know when a join was unable to duplicate rows. If the outer side
> of a join didn't duplicate the inner side, then the join RelOptInfo
> could keep the UniqueKeys from the inner rel, and vice-versa. If both
> didn't duplicate then the join rel would obtain the UniqueKeys from
> both sides of the join. The idea here is that this would be a better
> way to detect unique joins, and also when it came to the grouping
> planner we'd also know if the distinct or group by should be a no-op.
> DISTINCT could be skipped, and GROUP BY could do a group aggregate
> without any sort.
>
> I think these UniqueKeys ties into this work, perhaps not adding
> UniqueKeys to RelOptInfo, but just to Path so that we create paths
> that have UniqueKeys during create_index_paths() based on some
> uniquekeys that are stored in PlannerInfo, similar to how we create
> index paths in build_index_paths() by checking if the index
> has_useful_pathkeys(). Doing it this way would open up more
> opportunities to use skip scans. For example, semi-joins and
> anti-joins could make use of them if the uniquekeys covered the entire
> join condition. With this idea, the code you've added in
> create_distinct_paths() can just search for the cheapest path that has
> the correct uniquekeys, or if none exist then just do the normal
> sort->unique or hash agg. I'm not entirely certain how we'd instruct
> a semi/anti joined relation to build such paths, but that seems like a
> problem that could be dealt with when someone does the work to allow
> skip scans to be used for those.
>
> Also, I'm not entirely sure that these UniqueKeys should make use of
> PathKey since there's no real need to know about pk_opfamily,
> pk_strategy, pk_nulls_first as those all just describe how the keys
> are ordered. We just need to know if they're distinct or not. All
> that's left after removing those fields is pk_eclass, so could
> UniqueKeys just be a list of EquivalenceClass? or perhaps even a
> Bitmapset with indexes into PlannerInfo->ec_classes (that might be
> premature for not since we've not yet got
> https://commitfest.postgresql.org/23/1984/ or
> https://commitfest.postgresql.org/23/2019/ ) However, if we did use
> PathKey, that does allow us to quickly check if the UniqueKeys are
> contained within the PathKeys, since pathkeys are canonical which
> allows us just to compare their memory address to know if two are
> equal, however, if you're storing eclasses we could probably get the
> same just by comparing the address of the eclass to the pathkey's
> pk_eclass.

Interesting, thanks for sharing this.

> I also agree with James that this should not be limited to Index Only
> Scans. From testing the patch, the following seems pretty strange to
> me:
> ...
> explain analyze select distinct on (a) a,b from abc order by a,b;
> explain analyze select distinct on (a) a,b,c from abc order by a,b;
> ...

Yes, but I believe this limitation is not intrinsic to the idea of the patch,
and the very same approach can be used for IndexScan in the second example.
I've already prepared a new version to enable it for IndexScan with minimal
modifications, just need to rebase it on top of the latest changes and then
can post it. Although still there would be some limitations I guess (e.g. the
first thing I've stumbled upon is that an index scan with a filter wouldn't
work well, because checking qual causes with a filter happens after
ExecScanFetch)

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-19 13:57:29
Message-ID:	CA+q6zcVVigO_1KRBxvzKfnR593CEE7FKR-bwP-XcH6AVHOo4+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Jun 16, 2019 at 5:03 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > I also agree with James that this should not be limited to Index Only
> > Scans. From testing the patch, the following seems pretty strange to
> > me:
> > ...
> > explain analyze select distinct on (a) a,b from abc order by a,b;
> > explain analyze select distinct on (a) a,b,c from abc order by a,b;
> > ...
>
> Yes, but I believe this limitation is not intrinsic to the idea of the patch,
> and the very same approach can be used for IndexScan in the second example.
> I've already prepared a new version to enable it for IndexScan with minimal
> modifications, just need to rebase it on top of the latest changes and then
> can post it. Although still there would be some limitations I guess (e.g. the
> first thing I've stumbled upon is that an index scan with a filter wouldn't
> work well, because checking qual causes with a filter happens after
> ExecScanFetch)

Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.

Attachment	Content-Type	Size
v18-0001-Index-skip-scan.patch	application/octet-stream	57.5 KB

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-20 13:20:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 6/19/19 9:57 AM, Dmitry Dolgov wrote:
> Here is what I was talking about, POC for an integration with index scan. About
> using of create_skipscan_unique_path and suggested planner improvements, I hope
> together with Jesper we can come up with something soon.
>

I made some minor changes, but I did move all the code in
create_distinct_paths() under enable_indexskipscan to limit the overhead
if skip scan isn't enabled.

Attached is v20, since the last patch should have been v19.

Best regards,
Jesper

Attachment	Content-Type	Size
v20-0001-Index-skip-scan.patch	text/x-patch	57.6 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Peter Geoghegan" <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Thomas Munro" <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-22 10:17:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not entirely right.

-Floris

drop table if exists a;
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a desc, b desc;
explain select distinct on (a) a,b from a order by a desc, b desc;

DROP TABLE
CREATE TABLE
INSERT 0 50000
CREATE INDEX
ANALYZE
a | b
---+-------
5 | 10000
5 | 1
4 | 1
3 | 1
2 | 1
1 | 1
(6 rows)

QUERY PLAN
---------------------------------------------------------------------------------
Index Only Scan Backward using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=8)
Scan mode: Skip scan
(2 rows)

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-22 15:44:19
Message-ID:	CA+q6zcXV3cPWPEuZsPdN4jSOcBZifxhuY=cA3+QPMj+622kr5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee(at)optiver(dot)com>
wrote:
> The following sql statement seems to have incorrect results - some logic
in
> the backwards scan is currently not entirely right.

Thanks for testing! You're right, looks like in the current implementation
in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "Peter Geoghegan" <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Thomas Munro" <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-22 22:15:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Thanks. Looking forward to it. I think I found some other strange behavior. Given the same table as in my previous e-mail, the following queries also return inconsistent results. I spent some time trying to debug it, but can't easily pinpoint the cause. It looks like it also skips over one value too much, my guess is during _bt_skippage call in _bt_skip?
Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess.

-- this one is OK
postgres=# select distinct on (a) a,b from a where b>1;
a | b
---+---
1 | 2
2 | 2
3 | 2
4 | 2
5 | 2
(5 rows)

-- this one is not OK, it seems to skip too much
postgres=# select distinct on (a) a,b from a where b=2;
a | b
---+---
1 | 2
3 | 2
5 | 2
(3 rows)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-23 01:01:57
Message-ID:	CAH2-WzmRT_0YbHF05axQb2OiTUQiQAkr0Lznntj8X3KADKZ-5g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess.

Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.

The patch should be considered a very rough hack, for now. It leaks
memory like crazy. But I think that you'll find it helpful.

--
Peter Geoghegan

Attachment	Content-Type	Size
0012-Index-scan-positioning-DEBUG1-instrumentation.patch	application/octet-stream	19.2 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Thomas Munro" <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-06-23 11:04:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,10000), (2, 1), (2,2), ... (2,10000) until (5,10000).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on this, it finds the first tuple larger than this: (2,1). This is not the tuple that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);
....
/* Now read the data */
if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before (2,1), which is (1,10000)). This is done, because if this tuple were to be returned, there would be a call to _bt_next afterwards, which would then conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes things up, because it only reads tuples that match all the provided keys (so where b=2). The only tuple it'll return is (2,2). This will be the tuple that is set, however, on the call to _bt_next, the tuple is first incremented, so we'll find (2,3) there which doesn't match our keys. This leads it to skip (2,2) in our result set.

I was wondering about something else: don't we also have another problem with updating this current index tuple by skipping before calling btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples when scan->kill_prior_tuple is true. I'm not too familiar with the concept of killing dead tuples while doing index scans, but by looking at the code it seems to be possible that btgettuple returns a tuple, caller processes it and sets kill_prior_tuple to true in order to have it killed. However, then the skip scan kicks in, which sets the current tuple to a completely different tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my reasoning correct here or am I missing something?

-Floris

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-23 13:10:31
Message-ID:	20190623131031.65zmz2m4waew6paa@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:

1) config.sgml

I'm not sure why the enable_indexskipscan section says

This parameter requires that <varname>enable_indexonlyscan</varname>
is <literal>on</literal>.

I suppose it's the same thing as for enable_indexscan, and we don't say
anything like that for that GUC.

2) indices.sgml

The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?

3) nbtsearch.c

_bt_skip - comments are formatted incorrectly
_bt_update_skip_scankeys - missing comment
_bt_scankey_within_page - missing comment

4) explain.c

There are duplicate blocks of code for IndexScan and IndexOnlyScan:

if (indexscan->skipPrefixSize > 0)
{
if (es->format != EXPLAIN_FORMAT_TEXT)
ExplainPropertyInteger("Distinct Prefix", NULL,
indexscan->skipPrefixSize,
es);
}

I suggest we wrap this into a function ExplainIndexSkipScanKeys() or
something like that.

Also, there's this:

if (((IndexScan *) plan)->skipPrefixSize > 0)
{
ExplainPropertyText("Scan mode", "Skip scan", es);
}

That does not make much sense - there's just a single 'scan mode' value.
So I suggest we do the same thing as for unique joins, i.e.

ExplainPropertyBool("Skip Scan",
(((IndexScan *) plan)->skipPrefixSize > 0),
es);

5) nodeIndexOnlyScan.c

In ExecInitIndexOnlyScan, we should initialize the ioss_ fields a bit
later, with the existing ones. This is just cosmetic issue, though.

6) nodeIndexScan.c

I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?

7) pathnode.c

I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.

8) execnodes.h

The comment before IndexScanState mentions new field NumDistinctKeys,
but there's no such field added to the struct.

9) pathnodes.h

I don't understand what the uniq_distinct_pathkeys comment says :-(

10) plannodes.h

The naming of the new field (skipPrefixSize) added to IndexScan and
IndexOnlyScan is clearly inconsistent with the naming convention of the
existing fields.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-24 11:44:14
Message-ID:	CA+q6zcWDqDv7rXfm2d9Ah8z51qQgLK0-q_eLg8vLsdvb4RFf0Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> However, _bt_readpage messes things up, because it only reads tuples that
> match all the provided keys (so where b=2)

Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.

> I was wondering about something else: don't we also have another problem with
> updating this current index tuple by skipping before calling
> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
> killing dead tuples while doing index scans, but by looking at the code it
> seems to be possible that btgettuple returns a tuple, caller processes it and
> sets kill_prior_tuple to true in order to have it killed. However, then the
> skip scan kicks in, which sets the current tuple to a completely different
> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
> reasoning correct here or am I missing something?

Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.

> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I've done some initial review on v20 - just reading through the code, no
> tests at this point. Here are my comments:

Thank you!

> 2) indices.sgml
>
> The new section is somewhat unclear and difficult to understand, I think
> it'd deserve a rewording. Also, I wonder if we really should link to the
> wiki page about FSM problems. We have a couple of wiki links in the sgml
> docs, but those seem more generic while this seems as a development page
> that might disapper. But more importantly, that wiki page does not say
> anything about "Loose Index scans" so is it even the right wiki page?

Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).

> 6) nodeIndexScan.c
>
> I wonder why we even add and initialize the ioss_ fields for IndexScan
> nodes, when the skip scans require index-only scans?

Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.

> 7) pathnode.c
>
> I wonder how much was the costing discussed. It seems to me the logic is
> fairly similar to ideas discussed in the incremental sort patch, and
> we've been discussing some weak points there. I'm not sure how much we
> need to consider those issues here.

Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.

> 9) pathnodes.h
>
> I don't understand what the uniq_distinct_pathkeys comment says :-(

Yeah, sorry, I'll try to improve the commentaries in the next version, where
I'm going to address all the feedback.

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, James Coleman <jtc331(at)gmail(dot)com>
Subject:	Re: Index Skip Scan
Date:	2019-06-24 13:04:10
Message-ID:	20190624130410.rhor2lcffdvrh7ka@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 24, 2019 at 01:44:14PM +0200, Dmitry Dolgov wrote:
>> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>>
>> However, _bt_readpage messes things up, because it only reads tuples that
>> match all the provided keys (so where b=2)
>
>Right, the problem you've reported first had a similar origins. I'm starting to
>think that probably using _bt_readpage just like that is not exactly right
>thing to do, since the correct element is already found and there is no need to
>check if tuples are matching after one step back. I'll try to avoid it in the
>next version of patch.
>
>> I was wondering about something else: don't we also have another problem with
>> updating this current index tuple by skipping before calling
>> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
>> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
>> killing dead tuples while doing index scans, but by looking at the code it
>> seems to be possible that btgettuple returns a tuple, caller processes it and
>> sets kill_prior_tuple to true in order to have it killed. However, then the
>> skip scan kicks in, which sets the current tuple to a completely different
>> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
>> reasoning correct here or am I missing something?
>
>Need to check, but probably we can avoid that by setting kill_prior_tuple to
>false in case of skip scan as in index_rescan.
>
>> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> I've done some initial review on v20 - just reading through the code, no
>> tests at this point. Here are my comments:
>
>Thank you!
>
>> 2) indices.sgml
>>
>> The new section is somewhat unclear and difficult to understand, I think
>> it'd deserve a rewording. Also, I wonder if we really should link to the
>> wiki page about FSM problems. We have a couple of wiki links in the sgml
>> docs, but those seem more generic while this seems as a development page
>> that might disapper. But more importantly, that wiki page does not say
>> anything about "Loose Index scans" so is it even the right wiki page?
>
>Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
>mentioned it too, so probably I'm going to remove it (and instead describe the
>idea in a few words in the documentation itself).
>
>> 6) nodeIndexScan.c
>>
>> I wonder why we even add and initialize the ioss_ fields for IndexScan
>> nodes, when the skip scans require index-only scans?
>
>Skip scans required index-only scans until recently, when the patch was updated
>to incorporate the same approach for index scans too. My apologies, looks like
>documentation and some commentaries are still inconsistent about this topic.
>

Yes, if that's the case then various bits of docs and comments are rather
misleading, ant fields in IndexScanState should be named 'iss_'.

>> 7) pathnode.c
>>
>> I wonder how much was the costing discussed. It seems to me the logic is
>> fairly similar to ideas discussed in the incremental sort patch, and
>> we've been discussing some weak points there. I'm not sure how much we
>> need to consider those issues here.
>
>Can you please elaborate in a few words, which issues do you mean? Is it about
>non uniform distribution of distinct values? If so, I believe it's partially
>addressed when we have to skip too often, by searching a next index page.
>Although yeah, there is still an assumption about uniform distribution of
>distinct groups at the planning time.
>

Right, it's mostly about what happens when the group sizes are not close
to average size. The question is what happens in such cases - how much
slower will the plan be, compared to "current" plan without a skip scan?

I don't have a very good idea of the additional overhead associated with
skip-scans - presumably it's a bit more expensive, right?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-02 09:00:06
Message-ID:	CA+hUKGKo30N5VNuRWhDuMGVZ3hTcv4J5RGXe286GRJZLk_jBYQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> Attached is v20, since the last patch should have been v19.

I took this for a quick spin today. The DISTINCT ON support is nice
and I think it will be very useful. I've signed up to review it and
will have more to say later. But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.

1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node. If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases. I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does. I think it's probably better to consider
that in the grouping planner proper instead. I'm not sure.

2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards. Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20). That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().

--
Thomas Munro
https://enterprisedb.com

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-02 12:27:09
Message-ID:	CAKJS1f_v5X1Sb42ShOhsLzuMA6iX52PcBwpf4P-L9H_GDYA3-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> I took this for a quick spin today. The DISTINCT ON support is nice
> and I think it will be very useful. I've signed up to review it and
> will have more to say later. But today I had a couple of thoughts
> after looking into how src/backend/optimizer/plan/planagg.c works and
> wondering how to do some more skipping tricks with the existing
> machinery.
>
> 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
> AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
> scan, with the sort/unique logic inside the Aggregate node. If you
> write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
> a skip scan that is much faster in good cases. I suppose you could
> have a process_distinct_aggregates() in planagg.c that recognises
> queries of the right form and generates extra paths a bit like
> build_minmax_path() does. I think it's probably better to consider
> that in the grouping planner proper instead. I'm not sure.

I think to make that happen we'd need to do a bit of an overhaul in
nodeAgg.c to allow it to make use of presorted results instead of
having the code blindly sort rows for each aggregate that has a
DISTINCT or ORDER BY. The planner would also then need to start
requesting paths with pathkeys that suit the aggregate and also
probably dictate the order the AggRefs should be evaluated to allow
all AggRefs to be evaluated that can be for each sort order. Once
that part is done then the aggregates could then also request paths
with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
need to be pretty careful with that one since there may be other parts
of the query that require that all rows are fed in, not just 1 row per
value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
just feed through 1 row for each "i" value, since we need only the
ones that have "z > 0". Getting the first part of this solved is much
more important than making skip scans work here, I'd say. I think we
need to be able to walk before we can run with DISTINCT / ORDER BY
aggs.

> 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
> you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
> BY i if you're allowed to go backwards. Those queries are equivalent
> to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
> (though as Floris noted, the backwards version gives the wrong answers
> with v20). That does seem like a much more specific thing applicable
> only to MIN and MAX, and I think preprocess_minmax_aggregates() could
> be taught to handle that sort of query, building an index only scan
> path with skip scan in build_minmax_path().

For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.

The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.

[1] https://www.postgresql.org/message-id/CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-03 17:31:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Thomas and David,

Thanks for the feedback !

On 7/2/19 8:27 AM, David Rowley wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> I took this for a quick spin today. The DISTINCT ON support is nice
>> and I think it will be very useful. I've signed up to review it and
>> will have more to say later. But today I had a couple of thoughts
>> after looking into how src/backend/optimizer/plan/planagg.c works and
>> wondering how to do some more skipping tricks with the existing
>> machinery.
>>
>> 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or
>> AVG(DISTINCT ...) or any other aggregate). Right now you get a seq
>> scan, with the sort/unique logic inside the Aggregate node. If you
>> write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
>> a skip scan that is much faster in good cases. I suppose you could
>> have a process_distinct_aggregates() in planagg.c that recognises
>> queries of the right form and generates extra paths a bit like
>> build_minmax_path() does. I think it's probably better to consider
>> that in the grouping planner proper instead. I'm not sure.
>
> I think to make that happen we'd need to do a bit of an overhaul in
> nodeAgg.c to allow it to make use of presorted results instead of
> having the code blindly sort rows for each aggregate that has a
> DISTINCT or ORDER BY. The planner would also then need to start
> requesting paths with pathkeys that suit the aggregate and also
> probably dictate the order the AggRefs should be evaluated to allow
> all AggRefs to be evaluated that can be for each sort order. Once
> that part is done then the aggregates could then also request paths
> with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
> need to be pretty careful with that one since there may be other parts
> of the query that require that all rows are fed in, not just 1 row per
> value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
> just feed through 1 row for each "i" value, since we need only the
> ones that have "z > 0". Getting the first part of this solved is much
> more important than making skip scans work here, I'd say. I think we
> need to be able to walk before we can run with DISTINCT / ORDER BY
> aggs.
>

I agree that the above is outside of scope for the first patch -- I
think the goal should be the simple use-cases for IndexScan and
IndexOnlyScan.

Maybe we should expand [1] with possible cases, so we don't lose track
of the ideas.

>> 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
>> you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
>> BY i if you're allowed to go backwards. Those queries are equivalent
>> to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
>> (though as Floris noted, the backwards version gives the wrong answers
>> with v20). That does seem like a much more specific thing applicable
>> only to MIN and MAX, and I think preprocess_minmax_aggregates() could
>> be taught to handle that sort of query, building an index only scan
>> path with skip scan in build_minmax_path().
>
> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
>

Ok.

> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.
>

I'm looking at this, and will keep this in mind.

Thanks !

[1] https://wiki.postgresql.org/wiki/Loose_indexscan

Best regards,
Jesper

From:	David Fetter <david(at)fetter(dot)org>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-03 19:46:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.

Is that actually true, though? I can see unique {a, b, c} => unique
{a, b}, but for example:

a | b | c
--|---|--
1 | 2 | 3
1 | 2 | 4
1 | 2 | 5

is unique on {a, b, c} but not on {a, b}, at least as I understand the
way "unique" is used here, which is 3 distinct {a, b, c}, but only one
{a, b}.

Or I could be missing something obvious, and in that case, please
ignore.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-03 21:02:44
Message-ID:	CAAaqYe_Yg6uYr-2PG9emw9Nh8ojA0wuDEOVWUBqDaHYpJ7gNgA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david(at)fetter(dot)org> wrote:
>
> On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >
> > The more I think about these UniqueKeys, the more I think they need to
> > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > should be equivalent to { y, x }, but with PathKeys, that's not the
> > case, since the order of each key matters. UniqueKeys equivalent
> > version of pathkeys_contained_in() would not care about the order of
> > individual keys, it would say things like, { a, b, c } is contained in
> > { b, a }, since if the path is unique on columns { b, a } then it must
> > also be unique on { a, b, c }.
>
> Is that actually true, though? I can see unique {a, b, c} => unique
> {a, b}, but for example:
>
> a | b | c
> --|---|--
> 1 | 2 | 3
> 1 | 2 | 4
> 1 | 2 | 5
>
> is unique on {a, b, c} but not on {a, b}, at least as I understand the
> way "unique" is used here, which is 3 distinct {a, b, c}, but only one
> {a, b}.
>
> Or I could be missing something obvious, and in that case, please
> ignore.

I think that example is the opposite direction of what David (Rowley)
is saying. Unique on {a, b} implies unique on {a, b, c} while you're
correct that the inverse doesn't hold.

Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
different from pathkeys.

James Coleman

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	James Coleman <jtc331(at)gmail(dot)com>
Cc:	David Fetter <david(at)fetter(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-03 22:06:11
Message-ID:	CAKJS1f_0aUV6SSKTMFjm2QyT8LiDedJEAZtrAEEi-qQHxRXgng@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david(at)fetter(dot)org> wrote:
> >
> > On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> > > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > >
> > > The more I think about these UniqueKeys, the more I think they need to
> > > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > > should be equivalent to { y, x }, but with PathKeys, that's not the
> > > case, since the order of each key matters. UniqueKeys equivalent
> > > version of pathkeys_contained_in() would not care about the order of
> > > individual keys, it would say things like, { a, b, c } is contained in
> > > { b, a }, since if the path is unique on columns { b, a } then it must
> > > also be unique on { a, b, c }.
> >
> > Is that actually true, though? I can see unique {a, b, c} => unique
> > {a, b}, but for example:
> >
> > a | b | c
> > --|---|--
> > 1 | 2 | 3
> > 1 | 2 | 4
> > 1 | 2 | 5
> >
> > is unique on {a, b, c} but not on {a, b}, at least as I understand the
> > way "unique" is used here, which is 3 distinct {a, b, c}, but only one
> > {a, b}.
> >
> > Or I could be missing something obvious, and in that case, please
> > ignore.
>
> I think that example is the opposite direction of what David (Rowley)
> is saying. Unique on {a, b} implies unique on {a, b, c} while you're
> correct that the inverse doesn't hold.
>
> Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
> and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
> different from pathkeys.

Yeah, exactly. A superset of the unique columns is still unique.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	David Fetter <david(at)fetter(dot)org>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	James Coleman <jtc331(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-04 00:41:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 04, 2019 at 10:06:11AM +1200, David Rowley wrote:
> On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331(at)gmail(dot)com> wrote:
> > I think that example is the opposite direction of what David (Rowley)
> > is saying. Unique on {a, b} implies unique on {a, b, c} while you're
> > correct that the inverse doesn't hold.
> >
> > Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
> > and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
> > different from pathkeys.
>
> Yeah, exactly. A superset of the unique columns is still unique.

Thanks for clarifying!

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-04 10:59:47
Message-ID:	CA+hUKG+oY0u1YON4KHikmX0aO6tFBOF_LzZQOTX8bVUbe1j5mA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 3, 2019 at 12:27 AM David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
> > you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP
> > BY i if you're allowed to go backwards. Those queries are equivalent
> > to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
> > (though as Floris noted, the backwards version gives the wrong answers
> > with v20). That does seem like a much more specific thing applicable
> > only to MIN and MAX, and I think preprocess_minmax_aggregates() could
> > be taught to handle that sort of query, building an index only scan
> > path with skip scan in build_minmax_path().
>
> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.

While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
desirable. I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row. Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.

--
Thomas Munro
https://enterprisedb.com

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-05 10:54:40
Message-ID:	CA+q6zcWC2BgJwFf31xQUZ+FbP+J0zR8fM0sGVMuCYJDbAW_GAw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> > The following sql statement seems to have incorrect results - some logic in
> > the backwards scan is currently not entirely right.
>
> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Here is finally a new version of the patch, where all the mentioned issues
seems to be fixed and the corresponding new tests should keep it like that
(I've skipped all the pubs at PostgresLondon for that). Also I've addressed the
most of feedback from Tomas, except the points about planning improvements
(which is still in our todo list). By no means it's a final result (e.g. I
guess `_bt_read_closest` must be improved), but I hope making progress
step-by-step will help anyway. Also I've fixed some, how it is popular to say
here, brain fade, where I mixed up scan directions.

> On Tue, Jul 2, 2019 at 11:00 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > Attached is v20, since the last patch should have been v19.
>
> I took this for a quick spin today. The DISTINCT ON support is nice
> and I think it will be very useful. I've signed up to review it and
> will have more to say later. But today I had a couple of thoughts
> after looking into how src/backend/optimizer/plan/planagg.c works and
> wondering how to do some more skipping tricks with the existing
> machinery.
>
> On Thu, Jul 4, 2019 at 1:00 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> I mention all this stuff not because I want us to run before we can
> walk, but because to be ready to commit the basic distinct skip scan
> feature, I think we should know approximately how it'll handle the
> future stuff we'll need.

Great, thank you! I agree with Jesper that probably some parts of this are
outside of scope for the first patch, but we definitely can take a look at what
needs to be done to make the current implementation more flexible, so the
follow up would be just natural.

Attachment	Content-Type	Size
v21-0001-Index-skip-scan.patch	application/octet-stream	66.7 KB

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-09 13:32:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 7/4/19 6:59 AM, Thomas Munro wrote:
>> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
>> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
>

David, are you thinking about something like the attached ?

Some questions.

* Do you see UniqueKey as a "complete" planner node ?
- I didn't update the nodes/*.c files for this yet

* Is a UniqueKey with a list of EquivalenceClass best, or a list of
UniqueKey with a single EquivalenceClass

Likely more questions around this coming -- should this be a separate
thread ?

Based on this I'll start to update the v21 patch to use UniqueKey, and
post a new version.

> While updating the Loose Index Scan wiki page with links to other
> products' terminology on this subject, I noticed that MySQL can
> skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
> desirable. I think it requires a new kind of skipping: I think you
> have to be able to skip to the first AND last key that has each
> distinct prefix, and then stick a regular agg on top to collapse them
> into one row. Such a path would not be so neatly describable by
> UniqueKeys, or indeed by the amskip() interface in the current patch.
> I mention all this stuff not because I want us to run before we can
> walk, but because to be ready to commit the basic distinct skip scan
> feature, I think we should know approximately how it'll handle the
> future stuff we'll need.
>

Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.

Feedback greatly appreciated.

Best regards,
Jesper

Attachment	Content-Type	Size
uniquekey.txt	text/plain	10.8 KB

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 02:14:06
Message-ID:	CA+hUKGKmdMABJ9k1BbSDgeVsDM3_nLx091mVGA1NzyT06G6GXQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 10, 2019 at 1:32 AM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > While updating the Loose Index Scan wiki page with links to other
> > products' terminology on this subject, I noticed that MySQL can
> > skip-scan MIN() and MAX() in the same query. Hmm. That seems quite
> > desirable. I think it requires a new kind of skipping: I think you
> > have to be able to skip to the first AND last key that has each
> > distinct prefix, and then stick a regular agg on top to collapse them
> > into one row. Such a path would not be so neatly describable by
> > UniqueKeys, or indeed by the amskip() interface in the current patch.
> > I mention all this stuff not because I want us to run before we can
> > walk, but because to be ready to commit the basic distinct skip scan
> > feature, I think we should know approximately how it'll handle the
> > future stuff we'll need.
>
> Thomas, do you have any ideas for this ? I can see that MySQL did the
> functionality in two change sets (base and function support), but like
> you said we shouldn't paint ourselves into a corner.

I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in
the current patch is SKIP_FIRST behaviour. Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top. I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().

--
Thomas Munro
https://enterprisedb.com

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 13:10:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 7/9/19 10:14 PM, Thomas Munro wrote:
>> Thomas, do you have any ideas for this ? I can see that MySQL did the
>> functionality in two change sets (base and function support), but like
>> you said we shouldn't paint ourselves into a corner.
>
> I think amskip() could be augmented by later patches to take a
> parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
> SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in
> the current patch is SKIP_FIRST behaviour. Being able to choose
> SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
> ORDER BY i (ie forward scan for the order, but wanting the highest key
> for each distinct prefix), and being able to choose both allows for i,
> MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that
> allows you to ask for both might be incompatible with David's
> suggestion of tracking UniqueKeys in paths, but now I see that it
> isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
> and therefore require a regular agg on top. I suspect that's OK: this
> min/max transformation stuff is highly specialised and can have
> whatever magic path-building logic is required in
> preprocess_minmax_aggregates().
>

Ok, great.

Thanks for your feedback !

Best regards,
Jesper

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "Peter Geoghegan" <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Thomas Munro" <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 14:39:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Here is finally a new version of the patch, where all the mentioned issues

> seems to be fixed and the corresponding new tests should keep it like that
> (I've skipped all the pubs at PostgresLondon for that).

Thanks for the new patch! Really appreciate the work you're putting into it.

I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:

- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix

- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix

These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.

Also, I think in planner.c, line 4831, we should actually be looking at whether uniq_distinct_pathkeys is NIL, rather than the current check for distinct_pathkeys. That'll make the planner pick the skip scan even with queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. This leads to it scanning every item for a=2 even though it can stop after the first one.

I'll do some more tests with the patch.

-Floris

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 14:48:43
Message-ID:	CA+q6zcU66Gx954SjOT5nZRRbYG-gVVNJx5NSbn3Gn2aKwtaX1Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.

> On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> David, are you thinking about something like the attached ?
>
> Some questions.
>
> * Do you see UniqueKey as a "complete" planner node ?
> - I didn't update the nodes/*.c files for this yet
>
> * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> UniqueKey with a single EquivalenceClass

Just for me to clarify, the idea is to replace PathKeys with a new concept of
"UniqueKey" for skip scans, right? If I see it correctly, of course

UniqueKeys { x, y } == UniqueKeys { y, x }

from the result point of view, but the execution costs could be different due
to different values distribution. In fact there are efforts to utilize this to
produce more optimal order [1], but with UniqueKeys concept this information is
lost. Obviously it's not something that could be immediately (or even never) a
problem for skip scan functionality, but I guess it's still worth to point it
out.

> On Wed, Jul 10, 2019 at 4:40 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now.

Thanks for testing! Could you provide a test case to show what exactly is the
problem?

[1]: https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 14:52:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;

set enable_indexskipscan=1; // setting this to 0 yields different results
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;

BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;

FETCH FROM c;
FETCH BACKWARD FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

END;

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 15:00:10
Message-ID:	CA+q6zcVVaWQvD91R0tkPoHsjY-u_utE2Y3aVuYNxKFMhcd8OdA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Jul 10, 2019 at 4:52 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> > Thanks for testing! Could you provide a test case to show what exactly is the
> > problem?
>
> create table a (a int, b int, c int);
> insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
> create index on a (a,b);
> analyze a;
>
> set enable_indexskipscan=1; // setting this to 0 yields different results
> set random_page_cost=1;
> explain SELECT DISTINCT ON (a) a,b FROM a;
>
> BEGIN;
> DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;
>
> FETCH FROM c;
> FETCH BACKWARD FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> END;

Ok, give me a moment, I'll check.

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 15:00:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

Note that in the case of a regular non-skip scan, this cursor backwards works because the Unique node on top does not support backwards scanning at all. Therefore, when creating the cursor, the actual plan actually contains a Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' never reaches the the index scan therefore, as it just fetches stuff from the materialize node.

-Floris

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-10 15:05:49
Message-ID:	CA+q6zcX-8KTkkpydVAKDryGU9MZhP-nW3U-cj2AS9r4_69wBrA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Jul 10, 2019 at 5:00 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
>
> > Thanks for testing! Could you provide a test case to show what exactly is the
> > problem?
>
> Note that in the case of a regular non-skip scan, this cursor backwards works
> because the Unique node on top does not support backwards scanning at all.
> Therefore, when creating the cursor, the actual plan actually contains a
> Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards'
> never reaches the the index scan therefore, as it just fetches stuff from the
> materialize node.

Yeah, I'm aware. The last time when I was busy with cursors I've managed to
make it work as I wanted, so at that time I decided to keep it like that, even
though without skip scan it wasn't doing backwards.

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 02:49:21
Message-ID:	CA+hUKGJ_YyLF5Kw+916D0PaVZ1vA4nw=4gFkeRzR0dRYq2qj1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
>
> - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
>
> - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
>
> These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.

Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST
argument to amskip() that I theorised about, to support DISTINCT ON.
Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to
drop DISTINCT ON support, or use Materialize for some cases. My vote
is: let's move forwards and add that parameter and make DISTINCT ON
work.

--
Thomas Munro
https://enterprisedb.com

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 05:16:35
Message-ID:	CAKJS1f9vXaZ0=5wLfz+F4rVk8BfksPTRfgF2S8=+Xj+hK_vROw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 11 Jul 2019 at 14:50, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> > I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
> >
> > - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
> >
> > - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
> >
> > - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
> >
> > - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
> >
> > These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.
>
> Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST
> argument to amskip() that I theorised about, to support DISTINCT ON.
> Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
> SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to
> drop DISTINCT ON support, or use Materialize for some cases. My vote
> is: let's move forwards and add that parameter and make DISTINCT ON
> work.

Does it not just need to know the current direction of the cursor's
scroll, then also the intended scan direction?

For the general forward direction but for a backwards cursor scroll,
we'd return the lowest value for each distinct prefix, but for the
general backwards direction (DESC case) we'd return the highest value
for each distinct prefix. Looking at IndexNext() the cursor direction
seems to be estate->es_direction and the general scan direction is
indicated by the plan's indexorderdir. Can't we just pass both of
those to index_skip() to have it decide what to do? If we also pass in
indexorderdir then index_skip() should know if it's to return the
highest or lowest value, right?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "Peter Geoghegan" <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Thomas Munro" <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 07:41:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> For the general forward direction but for a backwards cursor scroll,

> we'd return the lowest value for each distinct prefix, but for the
> general backwards direction (DESC case) we'd return the highest value
> for each distinct prefix. Looking at IndexNext() the cursor direction
> seems to be estate->es_direction and the general scan direction is
> indicated by the plan's indexorderdir. Can't we just pass both of
> those to index_skip() to have it decide what to do? If we also pass in
> indexorderdir then index_skip() should know if it's to return the
> highest or lowest value, right?

Correct, with these two values correct behavior can be deduced. The implementation of this is a bit cumbersome though. Consider a case like:

SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):
1,1,1
1,1,2
1,2,2
1,2,3
2,2,1
2,2,3
3,1,1
3,1,2
3,2,2
3,2,3

Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end.

If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2).
Perhaps anyone can think of an easier way to implement it?

I do think being able to use DISTINCT ON is very useful and it's worth the extra complications. In the future we can add even more useful skipping features to it, for example:
SELECT DISTINCT ON (a) * FROM a WHERE b =2
After skipping to the next prefix of column a, we can start a new search for (a,b)=(prefix,2) to avoid having to move one-by-one from the start of the prefix to the first matching element. There are many other useful optimizations possible. That won't have to be for this patch though :-)

-Floris

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 10:12:37
Message-ID:	CAKJS1f-82QepgeLpBDfVwKuq3A3vSS4NXtO7yYDV1BvzrtUWyA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):
> 1,1,1
> 1,1,2
> 1,2,2
> 1,2,3
> 2,2,1
> 2,2,3
> 3,1,1
> 3,1,2
> 3,2,2
> 3,2,3
>
> Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end.
>
> If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2).
> Perhaps anyone can think of an easier way to implement it?

One option is just don't implement it and just change
ExecSupportsBackwardScan() so that it returns false for skip index
scans, or perhaps at least implement an index am method to allow the
planner to be able to determine if the index implementation supports
it... amcanskipbackward

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 11:38:19
Message-ID:	CAKJS1f9xBUCdqj4hT3gc0Vp=BimUnRoQe2ogDEGBhuLv+txz_g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 11 Jul 2019 at 02:48, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> >
> > The more I think about these UniqueKeys, the more I think they need to
> > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > should be equivalent to { y, x }, but with PathKeys, that's not the
> > case, since the order of each key matters. UniqueKeys equivalent
> > version of pathkeys_contained_in() would not care about the order of
> > individual keys, it would say things like, { a, b, c } is contained in
> > { b, a }, since if the path is unique on columns { b, a } then it must
> > also be unique on { a, b, c }.
>
> > On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> >
> > David, are you thinking about something like the attached ?
> >
> > Some questions.
> >
> > * Do you see UniqueKey as a "complete" planner node ?
> > - I didn't update the nodes/*.c files for this yet
> >
> > * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> > UniqueKey with a single EquivalenceClass
>
> Just for me to clarify, the idea is to replace PathKeys with a new concept of
> "UniqueKey" for skip scans, right? If I see it correctly, of course
>
> UniqueKeys { x, y } == UniqueKeys { y, x }
>
> from the result point of view, but the execution costs could be different due
> to different values distribution. In fact there are efforts to utilize this to
> produce more optimal order [1], but with UniqueKeys concept this information is
> lost. Obviously it's not something that could be immediately (or even never) a
> problem for skip scan functionality, but I guess it's still worth to point it
> out.

The UniqueKeys idea is quite separate from pathkeys. Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined. Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }. That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version. This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-11 16:53:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:
> The UniqueKeys idea is quite separate from pathkeys. Currently, a
> Path can have a List of PathKeys which define the order that the
> tuples will be read from the Plan node that's created from that Path.
> The idea with UniqueKeys is that a Path can also have a non-empty List
> of UniqueKeys to define that there will be no more than 1 row with the
> same value for the Paths UniqueKey column/exprs.
>
> As of now, if you look at standard_qp_callback() sets
> root->query_pathkeys, the idea here would be that the callback would
> also set a new List field named "query_uniquekeys" based on the
> group_pathkeys when non-empty and !root->query->hasAggs, or by using
> the distinct clause if it's non-empty. Then in build_index_paths()
> around the call to match_pathkeys_to_index() we'll probably also want
> to check if the index can support UniqueKeys that would suit the
> query_uniquekeys that were set during standard_qp_callback().
>
> As for setting query_uniquekeys in standard_qp_callback(), this should
> be simply a matter of looping over either group_pathkeys or
> distinct_pathkeys and grabbing the pk_eclass from each key and making
> a canonical UniqueKey from that. To have these canonical you'll need
> to have a new field in PlannerInfo named canon_uniquekeys which will
> do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
> need an equivalent of make_canonical_pathkey() in uniquekey.c
>
> With this implementation, the code that the patch adds in
> create_distinct_paths() can mostly disappear. You'd only need to look
> for a path that uniquekeys_contained_in() matches the
> root->query_uniquekeys and then just leave it to
> set_cheapest(distinct_rel); to pick the cheapest path.
>
> It would be wasted effort to create paths with UniqueKeys if there's
> multiple non-dead base rels in the query as the final rel in
> create_distinct_paths would be a join rel, so it might be worth
> checking that before creating such index paths in build_index_paths().
> However, down the line, we could carry the uniquekeys forward into the
> join paths if the join does not duplicate rows from the other side of
> the join... That's future stuff though, not for this patch, I don't
> think.
>
> I think a UniqueKey can just be a struct similar to PathKey, e.g be
> located in pathnodes.h around where PathKey is defined. Likely we'll
> need a uniquekeys.c file that has the equivalent to
> pathkeys_contained_in() ... uniquekeys_contained_in(), which return
> true if arg1 is a superset of arg2 rather than check for one set being
> a prefix of another. As you mention above: UniqueKeys { x, y } ==
> UniqueKeys { y, x }. That superset check could perhaps be optimized
> by sorting UniqueKey lists in memory address order, that'll save
> having a nested loop, but likely that's not going to be required for a
> first cut version. This would work since you'd want UniqueKeys to be
> canonical the same as PathKeys are (Notice that compare_pathkeys()
> only checks memory addresses of pathkeys and not equals()).
>
> I think the UniqueKey struct would only need to contain an
> EquivalenceClass field. I think all the other stuff that's in PathKey
> is irrelevant to UniqueKey.
>

Thanks for the feedback ! I'll work on these changes for the next
uniquekey patch.

Best regards,
Jesper

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-16 16:53:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi David,

Here is a patch more in that direction.

Some questions:

1) Do we really need the UniqueKey struct ? If it only contains the
EquivalenceClass field then we could just have a list of those instead.
That would make the patch simpler.

2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
the patch only uses canon_uniquekeys because the we attach the list
directly on the Path node.

I'll clean the patch up based on your feedback, and then start to rebase
v21 on it.

Thanks in advance !

Best regards,
Jesper

Attachment	Content-Type	Size
v2_uniquekey.txt	text/plain	19.8 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-20 18:30:42
Message-ID:	CA+q6zcWvEr1EphXnBXuViGYeqrByyn0BoC+GcaeiTm1F-DxgXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now. I
> think having just the 'scan direction' in the btree code is not enough to get
> this working properly, because we need to know whether we want the minimum or
> maximum element of a certain prefix. There are basically four cases:
>
> - Forward Index Scan + Forward cursor: we want the minimum element within a
> prefix and we want to skip 'forward' to the next prefix
>
> - Forward Index Scan + Backward cursor: we want the minimum element within a
> prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Forward cursor: we want the maximum element within a
> prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Backward cursor: we want the maximum element within a
> prefix and we want to skip 'forward' to the next prefix
>
> These cases make it rather complicated unfortunately. They do somewhat tie in
> with the previous discussion on this thread about being able to skip to the
> min or max value. If we ever want to support a sort of minmax scan, we'll
> encounter the same issues.

Yes, these four cases are indeed a very good point. I've prepared a new version
of the patch, where they + an index condition and handling of situations when
it eliminated one or more unique elements are addressed. It seems fixes issues
and works also for those hypothetical examples you've mentioned above, but of
course it looks pretty complicated and I need to polish it a bit before
posting.

> On Thu, Jul 11, 2019 at 12:13 PM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> > Data (imagine every tuple here actually occurs 10.000 times in the index to
> > see the benefit of skipping):
> > 1,1,1
> > 1,1,2
> > 1,2,2
> > 1,2,3
> > 2,2,1
> > 2,2,3
> > 3,1,1
> > 3,1,2
> > 3,2,2
> > 3,2,3
> >
> > Creating a cursor on this query and then moving forward, you should get
> > (1,1,2), (3,1,2). In the current implementation of the patch, after
> > bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves
> > forward one-by-one until it finds a match. This match only comes at (3,1,2)
> > however. Then it skips to the end.
> >
> > If you move the cursor backwards from the end of the cursor, you should
> > still get (3,1,2) (1,1,2). A possible implementation would start at the end
> > and do a skip to the beginning of the prefix: (3,1,1). Then it needs to
> > move forward one-by-one in order to find the first matching (minimum) item
> > (3,1,2). When it finds it, it needs to skip backwards to the beginning of
> > prefix 2 (2,2,1). It needs to move forwards to find the minimum element,
> > but should stop as soon as it detects that the prefix doesn't match anymore
> > (because there is no match for prefix 2, it will move all the way from
> > (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of
> > prefix 1: (1,1,1) and scan forward to find (1,1,2).
> > Perhaps anyone can think of an easier way to implement it?
>
> One option is just don't implement it and just change
> ExecSupportsBackwardScan() so that it returns false for skip index
> scans, or perhaps at least implement an index am method to allow the
> planner to be able to determine if the index implementation supports
> it... amcanskipbackward

Yep, it was discussed few times in this thread, and after we've discovered
(thanks to Floris) so many issues I was also one step away from implementing
this idea. But at the time time as Thomas correctly noticed, our implementation
needs to be extensible to handle future use cases, and this particular cursor
juggling seems already like a pretty good example of such "future use case". So
I hope by dealing with it we can also figure out what needs to be extensible.

> On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> Here is a patch more in that direction.
>
> Some questions:
>
> 1) Do we really need the UniqueKey struct ? If it only contains the
> EquivalenceClass field then we could just have a list of those instead.
> That would make the patch simpler.
>
> 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
> the patch only uses canon_uniquekeys because the we attach the list
> directly on the Path node.
>
> I'll clean the patch up based on your feedback, and then start to rebase
> v21 on it.

Thanks! I'll also take a look as soon an I'm finished with the last updates.

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-22 05:44:54
Message-ID:	CAKJS1f8-Qk254gn8exps7qD5oDNLyOO6LCiDqsWNufmi3w-atg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 17 Jul 2019 at 04:53, Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> Here is a patch more in that direction.

Thanks. I've just had a look over this and it roughly what I have in mind.

Here are the comments I noted down during the review:

cost_index:

I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.

For example, in a query such as:

SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;

you only need to perform estimate_num_groups() on "y".

I'm really not quite sure on what exactly will be required from
amcostestimate() here. The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.

build_index_paths:

I don't quite see where you're checking if the query's unique_keys
match what unique keys can be produced by the index. This is done for
pathkeys with:

pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
!found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
{
index_pathkeys = build_index_pathkeys(root, index,
ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);
orderbyclauses = NIL;
orderbyclausecols = NIL;
}

Here has_useful_pathkeys() checks if the query requires some ordering.
For unique keys you'll want to do the same. You'll have set the
query's unique key requirements in standard_qp_callback().

I think basically build_index_paths() should be building index paths
with unique keys, for all indexes that can support the query's unique
keys. I'm just a bit uncertain if we need to create both a normal
index path and another path for the same index with unique keys.
Perhaps we can figure that out down the line somewhere. Maybe it's
best to build path types for now, when possible, and we can consider
later if we can skip the non-uniquekey paths. Likely that would
require a big XXX comment to explain we need to review that before the
code makes it into core.

get_uniquekeys_for_index:

I think this needs to follow more the lead from build_index_pathkeys.
Basically, ask the index what it's pathkeys are.

standard_qp_callback:

build_group_uniquekeys & build_distinct_uniquekeys could likely be one
function that takes a list of SortGroupClause. You just either pass
the groupClause or distinctClause in. Pretty much the UniqueKey
version of make_pathkeys_for_sortclauses().

> Some questions:
>
> 1) Do we really need the UniqueKey struct ? If it only contains the
> EquivalenceClass field then we could just have a list of those instead.
> That would make the patch simpler.

I dunno about that. I understand it looks a bit pointless due to just
having one field, but perhaps we can worry about that later. If we
choose to ditch it and replace it with just an EquivalenceClass then
we can do that later.

> 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently
> the patch only uses canon_uniquekeys because the we attach the list
> directly on the Path node.

canon_uniquekeys should store at most one UniqueKey per
EquivalenceClass. The reason for this is for fast comparison. We can
compare memory addresses rather than checking individual fields are
equal. Now... yeah it's true that there is only one field so far and
we could just check the pointers are equal on the EquivalenceClasses,
but I think maybe this is in the same boat as #1. Let's do it for now
so we're sticking as close to the guidelines laid out by PathKeys and
once it's all working and plugged into skip scans then we can decide
if it needs a simplification pass over the code.

> I'll clean the patch up based on your feedback, and then start to rebase
> v21 on it.

Cool.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-22 17:10:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 7/22/19 1:44 AM, David Rowley wrote:
> Here are the comments I noted down during the review:
>
> cost_index:
>
> I know you've not finished here, but I think it'll need to adjust
> tuples_fetched somehow to account for estimate_num_groups() on the
> Path's unique keys. Any Eclass with an ec_has_const = true does not
> need to be part of the estimate there as there can only be at most one
> value for these.
>
> For example, in a query such as:
>
> SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
>
> you only need to perform estimate_num_groups() on "y".
>
> I'm really not quite sure on what exactly will be required from
> amcostestimate() here. The cost of the skip scan is not the same as
> the normal scan. So other that API needs adjusted to allow the caller
> to mention that we want skip scans estimated, or there needs to be
> another callback.
>

I think this part will become more clear once the index skip scan patch
is rebased, as we got the uniquekeys field on the Path, and the
indexskipprefixy info on the IndexPath node.

> build_index_paths:
>
> I don't quite see where you're checking if the query's unique_keys
> match what unique keys can be produced by the index. This is done for
> pathkeys with:
>
> pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
> !found_lower_saop_clause &&
> has_useful_pathkeys(root, rel));
> index_is_ordered = (index->sortopfamily != NULL);
> if (index_is_ordered && pathkeys_possibly_useful)
> {
> index_pathkeys = build_index_pathkeys(root, index,
> ForwardScanDirection);
> useful_pathkeys = truncate_useless_pathkeys(root, rel,
> index_pathkeys);
> orderbyclauses = NIL;
> orderbyclausecols = NIL;
> }
>
> Here has_useful_pathkeys() checks if the query requires some ordering.
> For unique keys you'll want to do the same. You'll have set the
> query's unique key requirements in standard_qp_callback().
>
> I think basically build_index_paths() should be building index paths
> with unique keys, for all indexes that can support the query's unique
> keys. I'm just a bit uncertain if we need to create both a normal
> index path and another path for the same index with unique keys.
> Perhaps we can figure that out down the line somewhere. Maybe it's
> best to build path types for now, when possible, and we can consider
> later if we can skip the non-uniquekey paths. Likely that would
> require a big XXX comment to explain we need to review that before the
> code makes it into core.
>
> get_uniquekeys_for_index:
>
> I think this needs to follow more the lead from build_index_pathkeys.
> Basically, ask the index what it's pathkeys are.
>
> standard_qp_callback:
>
> build_group_uniquekeys & build_distinct_uniquekeys could likely be one
> function that takes a list of SortGroupClause. You just either pass
> the groupClause or distinctClause in. Pretty much the UniqueKey
> version of make_pathkeys_for_sortclauses().
>
Yeah, I'll move this part of the index skip scan patch to the unique key
patch.

Thanks for your feedback !

Best regards,
Jesper

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-24 20:49:32
Message-ID:	CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>
> On 7/22/19 1:44 AM, David Rowley wrote:
> > Here are the comments I noted down during the review:
> >
> > cost_index:
> >
> > I know you've not finished here, but I think it'll need to adjust
> > tuples_fetched somehow to account for estimate_num_groups() on the
> > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > need to be part of the estimate there as there can only be at most one
> > value for these.
> >
> > For example, in a query such as:
> >
> > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> >
> > you only need to perform estimate_num_groups() on "y".
> >
> > I'm really not quite sure on what exactly will be required from
> > amcostestimate() here. The cost of the skip scan is not the same as
> > the normal scan. So other that API needs adjusted to allow the caller
> > to mention that we want skip scans estimated, or there needs to be
> > another callback.
> >
>
> I think this part will become more clear once the index skip scan patch
> is rebased, as we got the uniquekeys field on the Path, and the
> indexskipprefixy info on the IndexPath node.

Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.

Also I've added few tags, mostly to mention reviewers contribution.

Attachment	Content-Type	Size
v22-0001-Index-skip-scan.patch	application/octet-stream	81.6 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	9erthalion6(at)gmail(dot)com
Cc:	jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, florisvannee(at)optiver(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-07-25 11:17:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw(at)mail(dot)gmail(dot)com>
> > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> >
> > On 7/22/19 1:44 AM, David Rowley wrote:
> > > Here are the comments I noted down during the review:
> > >
> > > cost_index:
> > >
> > > I know you've not finished here, but I think it'll need to adjust
> > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > need to be part of the estimate there as there can only be at most one
> > > value for these.
> > >
> > > For example, in a query such as:
> > >
> > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > >
> > > you only need to perform estimate_num_groups() on "y".
> > >
> > > I'm really not quite sure on what exactly will be required from
> > > amcostestimate() here. The cost of the skip scan is not the same as
> > > the normal scan. So other that API needs adjusted to allow the caller
> > > to mention that we want skip scans estimated, or there needs to be
> > > another callback.
> > >
> >
> > I think this part will become more clear once the index skip scan patch
> > is rebased, as we got the uniquekeys field on the Path, and the
> > indexskipprefixy info on the IndexPath node.
>
> Here is what I came up with to address the problems, mentioned above in this
> thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
> occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
> that passes an index condition is few pages away - I'll try to tackle it soon).
> Just another small step forward, but I hope it's enough to rebase on top of it
> planner changes.
>
> Also I've added few tags, mostly to mention reviewers contribution.

I have some comments.

+ * The order of columns in the index should be the same, as for
+ * unique distincs pathkeys, otherwise we cannot use _bt_search
+ * in the skip implementation - this can lead to a missing
+ * records.

It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:

> if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))

It is perfect when needed_pathkeys is distinct_pathkeys. So
additional check is required if and only if it is not the case.

> if (enable_indexskipscan &&
> IsA(path, IndexPath) &&
> ((IndexPath *) path)->indexinfo->amcanskip &&
> (path->pathtype == T_IndexOnlyScan ||
> path->pathtype == T_IndexScan) &&
> (needed_pathkeys == root->distinct_pathkeys ||
> pathkeys_contained_in(root->distinct_pathkeys,
> path->pathkeys)))

path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().

I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?

+ /*
+ * XXX: In case of index scan quals evaluation happens after
+ * ExecScanFetch, which means skip results could be fitered out
+ */

Why can't we use skipscan path if having filter condition? If
something bad happens, the reason must be written here instead of
what we do.

+ if (path->pathtype == T_IndexScan &&
+ parse->jointree != NULL &&
+ parse->jointree->quals != NULL &&
+ ((List *)parse->jointree->quals)->length != 0)

It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)

+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.

It doesn't seem needed to me. Could you elaborate on the reason
for that?

+ * If advancing direction is different from index direction, we must
+ * skip right away, but _bt_skip requires a starting point.
+ */
+ if (direction * indexonlyscan->indexorderdir < 0 &&
+ !node->ioss_FirstTupleEmitted)

I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	9erthalion6(at)gmail(dot)com
Cc:	jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, florisvannee(at)optiver(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-07-25 11:29:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry, there's a too-hard-to-read part.

At Thu, 25 Jul 2019 20:17:37 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190725(dot)201737(dot)192223037(dot)horikyota(dot)ntt(at)gmail(dot)com>
> Hello.
>
> At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw(at)mail(dot)gmail(dot)com>
> > > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > >
> > > On 7/22/19 1:44 AM, David Rowley wrote:
> > > > Here are the comments I noted down during the review:
> > > >
> > > > cost_index:
> > > >
> > > > I know you've not finished here, but I think it'll need to adjust
> > > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > > need to be part of the estimate there as there can only be at most one
> > > > value for these.
> > > >
> > > > For example, in a query such as:
> > > >
> > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > > >
> > > > you only need to perform estimate_num_groups() on "y".
> > > >
> > > > I'm really not quite sure on what exactly will be required from
> > > > amcostestimate() here. The cost of the skip scan is not the same as
> > > > the normal scan. So other that API needs adjusted to allow the caller
> > > > to mention that we want skip scans estimated, or there needs to be
> > > > another callback.
> > > >
> > >
> > > I think this part will become more clear once the index skip scan patch
> > > is rebased, as we got the uniquekeys field on the Path, and the
> > > indexskipprefixy info on the IndexPath node.
> >
> > Here is what I came up with to address the problems, mentioned above in this
> > thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
> > occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
> > that passes an index condition is few pages away - I'll try to tackle it soon).
> > Just another small step forward, but I hope it's enough to rebase on top of it
> > planner changes.
> >
> > Also I've added few tags, mostly to mention reviewers contribution.
>
> I have some comments.
>
> + * The order of columns in the index should be the same, as for
> + * unique distincs pathkeys, otherwise we cannot use _bt_search
> + * in the skip implementation - this can lead to a missing
> + * records.
>
> It seems that it is enough that distinct pathkeys is contained in
> index pathkeys. If it's right, that is almost checked in existing
> code:
>
> > if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
>
> It is perfect when needed_pathkeys is distinct_pathkeys. So
> additional check is required if and only if it is not the case.

So I think the following change will work.

- + /* Consider index skip scan as well */
- + if (enable_indexskipscan &&
- + IsA(path, IndexPath) &&
- + ((IndexPath *) path)->indexinfo->amcanskip &&
- + (path->pathtype == T_IndexOnlyScan ||
- + path->pathtype == T_IndexScan) &&
- + root->distinct_pathkeys != NIL)

+ + if (enable_indexskipscan &&
+ + IsA(path, IndexPath) &&
+ + ((IndexPath *) path)->indexskipprefix >= 0 &&
+ + (needed_pathkeys == root->distinct_pathkeys ||
+ + pathkeys_contained_in(root->distinct_pathkeys,
+ + path->pathkeys)))

Additional comments on the condition above are:

> path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
> the check against them isn't needed. If you have concern on that,
> we can add that as Assert().
>
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?
>
>
> + /*
> + * XXX: In case of index scan quals evaluation happens after
> + * ExecScanFetch, which means skip results could be fitered out
> + */
>
> Why can't we use skipscan path if having filter condition? If
> something bad happens, the reason must be written here instead of
> what we do.
>
>
> + if (path->pathtype == T_IndexScan &&
> + parse->jointree != NULL &&
> + parse->jointree->quals != NULL &&
> + ((List *)parse->jointree->quals)->length != 0)
>
> It's better to use list_length instead of peeping inside. It
> handles the NULL case as well. (The structure has recently
> changed but .length is not, though.)
>
>
> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
>
> It doesn't seem needed to me. Could you elaborate on the reason
> for that?
>
>
> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
> + */
> + if (direction * indexonlyscan->indexorderdir < 0 &&
> + !node->ioss_FirstTupleEmitted)
>
> I'm confused by this. "direction" there is the physical scan
> direction (fwd/bwd) of index scan, which is already compensated
> by indexorderdir. Thus the condition means we do that when
> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
> "index direction" exactly means...)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-07-28 19:17:14
Message-ID:	CA+q6zcWC_TLO8fj_ad4pADLpr7UFjzRfXcMJ8gQsWtvT1Ky+9w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> I have some comments.

Thank you for the review!

> + * The order of columns in the index should be the same, as for
> + * unique distincs pathkeys, otherwise we cannot use _bt_search
> + * in the skip implementation - this can lead to a missing
> + * records.
>
> It seems that it is enough that distinct pathkeys is contained in
> index pathkeys. If it's right, that is almost checked in existing
> code:

Looks like you're right. After looking closely there seems to be an issue in
the original implementation, when we use wrong prefix_size in such cases.
Without this problem this condition is indeed enough.

> > if (enable_indexskipscan &&
> > IsA(path, IndexPath) &&
> > ((IndexPath *) path)->indexinfo->amcanskip &&
> > (path->pathtype == T_IndexOnlyScan ||
> > path->pathtype == T_IndexScan) &&
> > (needed_pathkeys == root->distinct_pathkeys ||
> > pathkeys_contained_in(root->distinct_pathkeys,
> > path->pathkeys)))
>
> path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
> the check against them isn't needed. If you have concern on that,
> we can add that as Assert().
>
> + if (path->pathtype == T_IndexScan &&
> + parse->jointree != NULL &&
> + parse->jointree->quals != NULL &&
> + ((List *)parse->jointree->quals)->length != 0)
>
> It's better to use list_length instead of peeping inside. It
> handles the NULL case as well. (The structure has recently
> changed but .length is not, though.)

Yeah, will change both (hopefully soon)

> + /*
> + * XXX: In case of index scan quals evaluation happens after
> + * ExecScanFetch, which means skip results could be fitered out
> + */
>
> Why can't we use skipscan path if having filter condition? If
> something bad happens, the reason must be written here instead of
> what we do.

Sorry, looks like I've failed to express this more clear in the commentary. The
point is that when an index scan (not for index only scan) has some conditions,
their evaluation happens after skipping, and I don't see any not too much
invasive way to apply skip correctly.

>
> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
>
> It doesn't seem needed to me. Could you elaborate on the reason
> for that?

This is needed for e.g. scan with a cursor backward without an index condition.
E.g. if we have:

1 1 2 2 3 3
1 2 3 4 5 6

and do

DECLARE c SCROLL CURSOR FOR
SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;

FETCH ALL FROM c;

we should get

1 2 3
1 3 5

When afterwards we do

FETCH BACKWARD ALL FROM c;

we should get

3 2 1
5 2 1

If we will use _bt_next first time without _bt_skip, the first pair would be
3 6 (the first from the end of the tuples, not from the end of the cursor).

> + * If advancing direction is different from index direction, we must
> + * skip right away, but _bt_skip requires a starting point.
> + */
> + if (direction * indexonlyscan->indexorderdir < 0 &&
> + !node->ioss_FirstTupleEmitted)
>
> I'm confused by this. "direction" there is the physical scan
> direction (fwd/bwd) of index scan, which is already compensated
> by indexorderdir. Thus the condition means we do that when
> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
> "index direction" exactly means...)

I'm not sure I follow, what do you mean by compensated? In general you're
right, as David Rowley mentioned above, indexorderdir is a general scan
direction, and direction is flipped estate->es_direction, which is a cursor
direction. The goal of this condition is catch when those two are different,
and we need to advance and read in different directions.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:
Date:	2019-07-29 08:31:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

On 2019/07/29 4:17, Dmitry Dolgov wrote:>> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Yeah, will change both (hopefully soon)

Thanks.

>> + /*
>> + * XXX: In case of index scan quals evaluation happens after
>> + * ExecScanFetch, which means skip results could be fitered out
>> + */
>>
>> Why can't we use skipscan path if having filter condition? If
>> something bad happens, the reason must be written here instead of
>> what we do.
>
> Sorry, looks like I've failed to express this more clear in the
> commentary. The point is that when an index scan (not for index
> only scan) has some conditions, their evaluation happens after
> skipping, and I don't see any not too much invasive way to
> apply skip correctly.

Yeah, your explanation was perfect for me. What I failed to
understand was what is expected to be done in the case. I
reconsidered and understood that:

For example, the following query:

select distinct (a, b) a, b, c from t where c < 100;

skip scan returns one tuple for one distinct set of (a, b) with
arbitrary one of c, If the choosed c doesn't match the qual and
there is any c that matches the qual, we miss that tuple.

If this is correct, an explanation like the above might help.

>> + * If advancing direction is different from index direction, we must
>> + * skip right away, but _bt_skip requires a starting point.
>>
>> It doesn't seem needed to me. Could you elaborate on the reason
>> for that?
>
> This is needed for e.g. scan with a cursor backward without an index condition.
> E.g. if we have:
>
> 1 1 2 2 3 3
> 1 2 3 4 5 6
>
> and do
>
> DECLARE c SCROLL CURSOR FOR
> SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;
>
> FETCH ALL FROM c;
>
> we should get
>
> 1 2 3
> 1 3 5
>
> When afterwards we do
>
> FETCH BACKWARD ALL FROM c;
>
> we should get
>
> 3 2 1
> 5 2 1
>
>
> If we will use _bt_next first time without _bt_skip, the first pair would be
> 3 6 (the first from the end of the tuples, not from the end of the cursor).

Thanks for the explanation. Sorry, I somehow thought that that is
right. You're right.

>> + * If advancing direction is different from index direction, we must
>> + * skip right away, but _bt_skip requires a starting point.
>> + */
>> + if (direction * indexonlyscan->indexorderdir < 0 &&
>> + !node->ioss_FirstTupleEmitted)
>>
>> I'm confused by this. "direction" there is the physical scan
>> direction (fwd/bwd) of index scan, which is already compensated
>> by indexorderdir. Thus the condition means we do that when
>> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
>> "index direction" exactly means...)
>
> I'm not sure I follow, what do you mean by compensated? In general you're

I meant that the "direction" is already changed to physical order
at the point.

> right, as David Rowley mentioned above, indexorderdir is a general scan
> direction, and direction is flipped estate->es_direction, which is a cursor
> direction. The goal of this condition is catch when those two are different,
> and we need to advance and read in different directions.

Mmm. Sorry and thank you for the explanation. I was
stupid. You're right. I perhaps mistook indexorderdir's
meaning. Maybe something like the following will work *for me*:p

| When we are fetching a cursor in backward direction, return the
| tuples that forward fetching should have returned. In other
| words, we return the last scanned tuple in a DISTINCT set. Skip
| to that tuple before returning the first tuple.

# Of course, I need someone to correct this!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-02 12:14:06
Message-ID:	CA+q6zcX0T3jhR=phrNxrT-iyMceJ3tLJimARmWFU8s8f_iK7Ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?

Looks like it's not that straightforward to do this only in create_index_path,
since to make this decision we need to have both parts, indexinfo and distinct
keys.

> Yeah, your explanation was perfect for me. What I failed to
> understand was what is expected to be done in the case. I
> reconsidered and understood that:
>
> For example, the following query:
>
> select distinct (a, b) a, b, c from t where c < 100;
>
> skip scan returns one tuple for one distinct set of (a, b) with
> arbitrary one of c, If the choosed c doesn't match the qual and
> there is any c that matches the qual, we miss that tuple.
>
> If this is correct, an explanation like the above might help.

Yes, that's correct, I've added this into commentaries.

> Maybe something like the following will work *for me*:p
>
> | When we are fetching a cursor in backward direction, return the
> | tuples that forward fetching should have returned. In other
> | words, we return the last scanned tuple in a DISTINCT set. Skip
> | to that tuple before returning the first tuple.

And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-02 16:03:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 8/2/19 8:14 AM, Dmitry Dolgov wrote:
> And this too (slightly rewritten:). We will soon post the new version of patch
> with updates about UniqueKey from Jesper.
>

Yes.

We decided to send this now, although there is still feedback from David
that needs to be considered/acted on.

The patches can be reviewed independently, but we will send them as a
set from now on. Development of UniqueKey will be kept separate though [1].

Note, that while UniqueKey can form the foundation of optimizations for
GROUP BY queries it isn't the focus for this patch series. Contributions
are very welcomed of course.

[1] https://github.com/jesperpedersen/postgres/tree/uniquekey

Best regards,
Jesper

Attachment	Content-Type	Size
v23_0001-Unique-key.patch	text/x-patch	25.7 KB
v23_0002-Index-skip-scan.patch	text/x-patch	76.1 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "James Coleman" <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-05 10:05:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet.

In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c

/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;

a | b
---+---
1 | 1
2 | 1
3 | 1
4 | 1
5 | 1
(5 rows)

delete from a where a=2 and b=1;
DELETE 1

select distinct on (a) a,b from a order by a,b;

a | b
---+-----
1 | 1
2 | 249 ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page.
3 | 1
4 | 1
5 | 1
(5 rows)
?

-Floris

[1] https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-05 20:38:52
Message-ID:	CA+q6zcWAKuwSGXh2mGJpeZ1d2Fu+Av13UYvFScs9mMcCCcrBPg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee <florisvannee(at)optiver(dot)com>
> wrote:
>
> The root->distinct_pathkeys is already filtered for redundant keys, so column
> 'a' is not in there anymore. Still, it'd be much faster to use the skip scan
> here, because a regular scan will scan all entries with a=1, even though
> we're really only interested in the first one. In previous versions, this
> would be fixed by changing the check in planner.c to use
> root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things
> change a bit now that the patch is rebased on the unique-keys patch. Would it
> be valid to change this check to root->query_uniquekeys != NIL to consider
> skip scans also for this query?

[including a commentary from Jesper]
On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:

Yes, the check should be for that. However, the query in question
doesn't have any query_pathkeys, and hence query_uniquekeys in
standard_qp_callback(), so therefore it isn't supported.

Your scenario is covered by one of the test cases in case the
functionality is supported. But, I think that is outside the scope of
the current patch.

> However, currently, the patch just reads the closest and then doesn't
> consider this page at all anymore, if the first tuple skipped to turns out to
> be not visible. Consider the following sql to illustrate:

For the records, the purpose of `_bt_read_closest` is not so much to reduce
amount of data we read from a page, but more to correctly handle those
situations we were discussing before with reading forward/backward in cursors,
since for that in some cases we need to remember previous values for stepping
to the next. I've limited number of items, fetched in this function just
because I was misled by having a check for dead tuples in `_bt_skip`. Of course
we can modify it to read a whole page and leave visibility check for the code
after `index_getnext_tid` (although in case if we know that all tuples on this
page are visilbe I guess it's not strictly necessary, but we still get
improvement from skipping itself).

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-06 08:37:55
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Yes, the check should be for that. However, the query in question
> doesn't have any query_pathkeys, and hence query_uniquekeys in
> standard_qp_callback(), so therefore it isn't supported

> Your scenario is covered by one of the test cases in case the
> functionality is supported. But, I think that is outside the scope of
> the current patch.

Ah alright, thanks. That makes it clear why it doesn't work.
From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.

Something else I just noticed which I'm just writing here for awareness; I don't think it's that pressing at the moment and can be left to another patch. When there are multiple indices on a table the planner gets confused and doesn't select an index-only skip scan even though it could. I'm guessing it just takes the first available index based on the DISTINCT clause and then doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
QUERY PLAN
--------------------------------------------------------------------
Index Scan using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=12)
Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.

> For the records, the purpose of `_bt_read_closest` is not so much to reduce
> amount of data we read from a page, but more to correctly handle those
> situations we were discussing before with reading forward/backward in cursors,
> since for that in some cases we need to remember previous values for stepping
> to the next. I've limited number of items, fetched in this function just
> because I was misled by having a check for dead tuples in `_bt_skip`. Of course
> we can modify it to read a whole page and leave visibility check for the code
> after `index_getnext_tid` (although in case if we know that all tuples on this
> page are visilbe I guess it's not strictly necessary, but we still get
> improvement from skipping itself).

I understand and I agree - primary purpose why we chose this function was to make it work correctly. I don't think it would be something for this patch to use the optimization of partially reading a page. My point was however, if this optimization was allowed in a future patch, it would have great performance benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good to take a close look at the implementation of this function then, because messing around with the previous/next is also not trivial. I think the current implementation also has a problem when the item that is skipped to, is the first item on the page. Eg. (this depends on page size)

postgres=# drop table if exists b; create table b as select a,b from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# select distinct on(a) a,b from b;
a | b
---+---
1 | 1
2 | 2 <-- (2,1) is the first item on the page and doesn't get selected by read_closest function. it returns the second item on page which is (2,2)
3 | 2
4 | 2
5 | 2
(5 rows)

-Floris

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-28 15:53:59
Message-ID:	CA+q6zcWhxVMv6aGOv+f+3b=4qQ=0Mhqk6pojT6p_BEsCtG9P-g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> Of course we can modify it to read a whole page and leave visibility check
> for the code after `index_getnext_tid` (although in case if we know that all
> tuples on this page are visilbe I guess it's not strictly necessary, but we
> still get improvement from skipping itself).

Sorry for long delay. Here is more or less what I had in mind. After changing
read_closest to read the whole page I couldn't resist to just merge it into
readpage itself, since it's basically the same. It could raise questions about
performance and how intrusive this patch is, but I hope it's not that much of a
problem (in the worst case we can split it back). I've also added few tests for
the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
put into reviewing!

Attachment	Content-Type	Size
v24-0001-Index-skip-scan.patch	application/octet-stream	80.4 KB

From:	Floris Van Nee <florisvannee(at)Optiver(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-08-28 19:31:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Sorry for long delay. Here is more or less what I had in mind. After changing
> read_closest to read the whole page I couldn't resist to just merge it into
> readpage itself, since it's basically the same. It could raise questions about>
> performance and how intrusive this patch is, but I hope it's not that much of a
> problem (in the worst case we can split it back). I've also added few tests for
> the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
> put into reviewing!

Putting it into one function makes sense I think. Looking at the patch, I think in general there are some good improvements in there.

I'm afraid I did manage to find another incorrect query result though, having to do with the keepPrev part and skipping to the first tuple on an index page:

postgres=# drop table if exists b; create table b as select a,b::int2 b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b,c); analyze b;
DROP TABLE
SELECT 1830
CREATE INDEX
ANALYZE
postgres=# set enable_indexskipscan=1;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
a | b | c
---+---+---
1 | 2 | 0
2 | 4 | 0
3 | 4 | 0
4 | 4 | 0
5 | 4 | 0
(5 rows)

postgres=# set enable_indexskipscan=0;
SET
postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;
a | b | c
---+---+---
1 | 2 | 0
2 | 2 | 0
3 | 2 | 0
4 | 2 | 0
5 | 2 | 0
(5 rows)

-Floris

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-02 13:28:08
Message-ID:	CA+q6zcWUs2eoYwfRj==2zjrUqoaqo57R=fQHAxAhLqu-z6P4vQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
>
> I'm afraid I did manage to find another incorrect query result though

Yes, it's an example of what I was mentioning before, that the current modified
implementation of `_bt_readpage` wouldn't work well in case of going between
pages. So far it seems that the only problem we can have is when previous and
next items located on a different pages. I've checked how this issue can be
avoided, I hope I will post a new version relatively soon.

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-04 20:45:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Surely it isn't right to add members prefixed with "ioss_" to
struct IndexScanState.

I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make
more sense to implement index_skip() to return the first tuple if the
scan is just starting? (I know little about executor, apologies if this
is a stupid question.)

It would be good to get more knowledgeable people to review this patch.
It's clearly something we want, yet it's been there for a very long
time.

Thanks

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-05 19:20:06
Message-ID:	CA+q6zcXO-XzM2Be7ZX8SZf7Xr-Hw5gMRXrtiMkUoivpF8J-9DA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Sep 2, 2019 at 3:28 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee(at)optiver(dot)com> wrote:
> >
> > I'm afraid I did manage to find another incorrect query result though
>
> Yes, it's an example of what I was mentioning before, that the current modified
> implementation of `_bt_readpage` wouldn't work well in case of going between
> pages. So far it seems that the only problem we can have is when previous and
> next items located on a different pages. I've checked how this issue can be
> avoided, I hope I will post a new version relatively soon.

Here is the version in which stepping between the pages works better. It seems
sufficient to fix the case you've mentioned before, but for that we need to
propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
can't say I like this solution. I have an idea that maybe it would be simpler
to teach the code after index_skip to not do `_bt_next` right after one skip
happened before. It should immediately elliminate several hacks from index skip
itself, so I'll try to pursue this idea.

> On Wed, Sep 4, 2019 at 10:45 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:

Thank you for checking it out!

> Surely it isn't right to add members prefixed with "ioss_" to
> struct IndexScanState.

Yeah, sorry. I've incorporated IndexScan support originally only to show that
it's possible (with some limitations), but after that forgot to clean up. Now
those fields are renamed.

> I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make
> more sense to implement index_skip() to return the first tuple if the
> scan is just starting? (I know little about executor, apologies if this
> is a stupid question.)

I'm not entirely sure, which exactly part do you mean? Now the first tuple is
returned by `_bt_first`, how would it help if index_skip will return it?

> It would be good to get more knowledgeable people to review this patch.
> It's clearly something we want, yet it's been there for a very long
> time.

Sure, that would be nice.

Attachment	Content-Type	Size
v25-0001-Index-skip-scan.patch	application/octet-stream	85.1 KB

From:	Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-05 19:41:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-Sep-05, Dmitry Dolgov wrote:

> Here is the version in which stepping between the pages works better. It seems
> sufficient to fix the case you've mentioned before, but for that we need to
> propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
> can't say I like this solution. I have an idea that maybe it would be simpler
> to teach the code after index_skip to not do `_bt_next` right after one skip
> happened before. It should immediately elliminate several hacks from index skip
> itself, so I'll try to pursue this idea.

Cool.

I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:

/*
* If advancing direction is different from index direction, we must
* skip right away, but _bt_skip requires a starting point.
*/
if (direction * indexscan->indexorderdir < 0 &&
!node->iss_FirstTupleEmitted)

Surely there's a better way to code that?

I think "scanstart" needs more documentation, both in the SGML docs as
well as the code comments surrounding it.

Please disregard my earlier comment about FirstTupleEmitted. I was
thinking that index_skip would itself emit a tuple (ie. call some
"getnext" internally) rather than just repositioning. There might still
be some more convenient way to represent this, but I have no immediate
advice.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-22 13:45:01
Message-ID:	CA+q6zcV2Ti99RSOHsauppY2_JZ8zvWhUmN9NWo0R4xf6HL=Szg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Thu, Sep 5, 2019 at 9:41 PM Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2019-Sep-05, Dmitry Dolgov wrote:
>
> > Here is the version in which stepping between the pages works better. It seems
> > sufficient to fix the case you've mentioned before, but for that we need to
> > propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
> > can't say I like this solution. I have an idea that maybe it would be simpler
> > to teach the code after index_skip to not do `_bt_next` right after one skip
> > happened before. It should immediately elliminate several hacks from index skip
> > itself, so I'll try to pursue this idea.
>
> Cool.

Here it is. Since now the code after index_skip knows whether to do
index_getnext or not, it's possible to use unmodified `_bt_readpage` /
`_bt_steppage`. To achieve that there is a flag that indicated whether or not
we were skipping to the current item (I guess it's possible to implement it
without such a flag, but the at the end result looked more ugly as for me). On
the way I've simplified few things, and all the tests we accumulated before are
still passing. I'm almost sure it's possible to implement some parts of the
code more elegant, but don't see yet how.

> I think multiplying two ScanDirections to watch for a negative result is
> pretty ugly:

Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?

> I think "scanstart" needs more documentation, both in the SGML docs as
> well as the code comments surrounding it.

I was able to remove it after another round of simplification.

Attachment	Content-Type	Size
v26-0001-Index-skip-scan.patch	application/octet-stream	77.2 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "jesper(dot)pedersen(at)redhat(dot)com" <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-23 02:02:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-Sep-22, Dmitry Dolgov wrote:

> > I think multiplying two ScanDirections to watch for a negative result is
> > pretty ugly:
>
> Probably, but the only alternative I see to check if directions are opposite is
> to check that directions come in pairs (back, forth), (forth, back). Is there
> an easier way?

Maybe use the ^ operator?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	alvherre(at)2ndquadrant(dot)com
Cc:	9erthalion6(at)gmail(dot)com, florisvannee(at)optiver(dot)com, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-09-24 08:35:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20190923020204(dot)GA2781(at)alvherre(dot)pgsql>
> On 2019-Sep-22, Dmitry Dolgov wrote:
>
> > > I think multiplying two ScanDirections to watch for a negative result is
> > > pretty ugly:
> >
> > Probably, but the only alternative I see to check if directions are opposite is
> > to check that directions come in pairs (back, forth), (forth, back). Is there
> > an easier way?
>
> Maybe use the ^ operator?

It's not a logical operator but a bitwise arithmetic operator,
which cannot be used if the operands is guaranteed to be 0 or 1
(in integer). In a-kind-of-standard, but hacky way, "(!a != !b)"
works as desired since ! is a logical operator.

Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
some good way.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	alvherre(at)2ndquadrant(dot)com
Cc:	9erthalion6(at)gmail(dot)com, florisvannee(at)optiver(dot)com, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-09-24 08:41:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 24 Sep 2019 17:35:47 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190924(dot)173547(dot)226622711(dot)horikyota(dot)ntt(at)gmail(dot)com>
> At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20190923020204(dot)GA2781(at)alvherre(dot)pgsql>
> > On 2019-Sep-22, Dmitry Dolgov wrote:
> >
> > > > I think multiplying two ScanDirections to watch for a negative result is
> > > > pretty ugly:
> > >
> > > Probably, but the only alternative I see to check if directions are opposite is
> > > to check that directions come in pairs (back, forth), (forth, back). Is there
> > > an easier way?
> >
> > Maybe use the ^ operator?
>
> It's not a logical operator but a bitwise arithmetic operator,
> which cannot be used if the operands is guaranteed to be 0 or 1
> (in integer). In a-kind-of-standard, but hacky way, "(!a != !b)"
> works as desired since ! is a logical operator.
>
> Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
> some good way.

Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,

(dir1 != 0 && dir1 == -dir2) ?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	9erthalion6(at)gmail(dot)com, florisvannee(at)optiver(dot)com, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-09-24 12:06:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-Sep-24, Kyotaro Horiguchi wrote:

> Sorry, it's not a boolean. A tristate value. From the definition
> (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> NoMovement did not exist. If it is not guranteed,
>
> (dir1 != 0 && dir1 == -dir2) ?

Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
definition? :-)

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	alvherre(at)2ndquadrant(dot)com
Cc:	9erthalion6(at)gmail(dot)com, florisvannee(at)optiver(dot)com, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, pg(at)bowt(dot)ie, tomas(dot)vondra(at)2ndquadrant(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject:	Re: Index Skip Scan
Date:	2019-09-25 01:03:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20190924120627(dot)GA12454(at)alvherre(dot)pgsql>
> On 2019-Sep-24, Kyotaro Horiguchi wrote:
>
> > Sorry, it's not a boolean. A tristate value. From the definition
> > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> > NoMovement did not exist. If it is not guranteed,
> >
> > (dir1 != 0 && dir1 == -dir2) ?
>
> Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
> definition? :-)

Yeah, sounds good to establish it as a part of ScanDirection's
definition.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-09-25 09:34:22
Message-ID:	CA+q6zcWUqeX18UJxMBP4m=4Sz2upS_FikooE07JT5zZryzBNpA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Sep 25, 2019 at 3:03 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20190924120627(dot)GA12454(at)alvherre(dot)pgsql>
> > On 2019-Sep-24, Kyotaro Horiguchi wrote:
> >
> > > Sorry, it's not a boolean. A tristate value. From the definition
> > > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> > > NoMovement did not exist. If it is not guranteed,
> > >
> > > (dir1 != 0 && dir1 == -dir2) ?
> >
> > Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
> > definition? :-)
>
> Yeah, sounds good to establish it as a part of ScanDirection's
> definition.

Yep, this way looks better.

Attachment	Content-Type	Size
v27-0001-Index-skip-scan.patch	application/octet-stream	77.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-02 18:56:08
Message-ID:	CAH2-WzkG0KRHjN_atViTcAC0-Yze5Mv0rfj+mPYhL8oLm0cnfw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> v27-0001-Index-skip-scan.patch

Some random thoughts on this:

* Is _bt_scankey_within_page() really doing the right thing within empty pages?

It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.

Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect. You can have a few hundred index entries for
each distinct value, and the page split logic added to Postgres 12 (by
commit fab25024) will still manage to "trap" each set of duplicates on
their own distinct leaf page. Leaf pages will have a high key that
looks like the values that appear on the page to the right. The goal
is for forward index scans to access the minimum number of leaf pages,
especially with low cardinality data and with multi-column indexes.
(See also: commit 29b64d1d)

A good way to see this for yourself is to get the Wisconsin Benchmark
tables (the tenk1 table and related tables from the regression tests)
populated using retail insertions. "CREATE TABLE __tenk1(like tenk1
including indexes); INSERT INTO __tenk1 SELECT * FROM tenk1;" is how I
like to set this up. Then you can see that we only access one leaf
page easily by forcing bitmap scans (i.e. "set enable* ..."), and
using "EXPLAIN (analyze, buffers) SELECT ... FROM __tenk1 WHERE ...",
where the SELECT query is a simple point lookup query (bitmap scans
happen to instrument the index buffer accesses in a way that makes it
absolutely clear how many index page buffers were touched). IIRC the
existing tenk1 indexes have no more than a few hundred duplicates for
each distinct value in all cases, so only one leaf page needs to be
accessed by simple "key = val" queries in all cases.

(I imagine that the "four" index you add in the regression test would
generally need to visit more than one leaf page for simple point
lookup queries, but in any case the high key is a useful way of
detecting a "break" in the values when indexing low cardinality data
-- these breaks are generally "aligned" to leaf page boundaries.)

I also like to visualize the keyspace of indexes when poking around at
that stuff, generally by using some of the queries that you can find
on the Wiki [1].

* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).

* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.

* Suggest running pgindent on the patch.

[1] https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index
--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-02 20:32:08
Message-ID:	CAH2-Wzmg-4ScB8kpDeQK44FitnNf=vzg97qHvNa0skO3S5Yj2w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 2, 2019 at 11:56 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > v27-0001-Index-skip-scan.patch
>
> Some random thoughts on this:

And now some more:

* I'm confused about this code in _bt_skip():

> /*
> * Andvance backward but read forward. At this moment we are at the next
> * distinct key at the beginning of the series. In case if scan just
> * started, we can read forward without doing anything else. Otherwise find
> * previous distinct key and the beginning of it's series and read forward
> * from there. To do so, go back one step, perform binary search to find
> * the first item in the series and let _bt_readpage do everything else.
> */
> else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
> {
> if (!scanstart)
> {
> _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
> offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
>
> /* One step back to find a previous value */
> _bt_readpage(scan, dir, offnum);

Why is it okay to call _bt_drop_lock_and_maybe_pin() like this? It
looks like that will drop the lock (but not the pin) on the same
buffer that you binary search with _bt_binsrch() (since the local
variable "buf" is also the buf in "so->currPos").

* It also seems a bit odd that you assume that the scan is
"scan->xs_want_itup", but then check that condition many times. Why
bother?

* Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
rather than just unlocking the buffer directly? We'll only drop the
pin for a scan that is "!scan->xs_want_itup", which is never the case
within _bt_skip().

I think that the macros and stuff that manage pins and buffer locks in
nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
value in trying to be consistent with existing nbtsearch.c code in
ways that aren't strictly necessary.

* Not sure why you need this code after throwing an error:

> else
> {
> elog(ERROR, "Could not read closest index tuples: %d", offnum);
> pfree(so->skipScanKey);
> so->skipScanKey = NULL;
> return false;
> }

[1] https://www.postgresql.org/message-id/flat/CAH2-Wz=m674-RKQdCG+jCD9QGzN1Kcg-FOdYw4-j+5_PfcHbpQ(at)mail(dot)gmail(dot)com
--
Peter Geoghegan

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-03 16:45:47
Message-ID:	20191103164547.oqcvno6467eikizg@localhost
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> v27-0001-Index-skip-scan.patch
>
> Some random thoughts on this:

Thanks a lot for the commentaries!

> * Is _bt_scankey_within_page() really doing the right thing within empty pages?
>
> It looks like you're accidentally using the high key when the leaf
> page is empty with forward scans (assuming that the leaf page isn't
> rightmost). You'll need to think about empty pages for both forward
> and backward direction scans there.

Yes, you're right, that's an issue I need to fix.

> Actually, using the high key in some cases may be desirable, once the
> details are worked out -- the high key is actually very helpful with
> low cardinality indexes. If you populate an index with retail
> insertions (i.e. don't just do a CREATE INDEX after the table is
> populated), and use low cardinality data in the indexed columns then
> you'll see this effect.

Can you please elaborate a bit more? I see that using high key will help
a forward index scans to access the minimum number of leaf pages, but
I'm not following how is it connected to the _bt_scankey_within_page? Or
is this commentary related in general to the whole implementation?

> * The extra scankeys that you are using in most of the new nbtsearch.c
> code is an insertion scankey -- not a search style scankey. I think
> that you should try to be a bit clearer on that distinction in
> comments. It is already confusing now, but at least there is only zero
> or one insertion scankeys per scan (for the initial positioning).
>
> * There are two _bt_skip() prototypes in nbtree.h (actually, there is
> a btskip() and a _bt_skip()). I understand that the former is a public
> wrapper of the latter, but I find the naming a little bit confusing.
> Maybe rename _bt_skip() to something that is a little bit more
> suggestive of its purpose.
>
> * Suggest running pgindent on the patch.

Sure, I'll incorporate mentioned improvements into the next patch
version (hopefully soon).

> And now some more:
>
> * I'm confused about this code in _bt_skip():
>
Yeah, it shouldn't be there, but rather before _bt_next, that expects
unlocked buffer. Will fix.

> * It also seems a bit odd that you assume that the scan is
> "scan->xs_want_itup", but then check that condition many times. Why
> bother?
>
> * Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
> rather than just unlocking the buffer directly? We'll only drop the
> pin for a scan that is "!scan->xs_want_itup", which is never the case
> within _bt_skip().
>
> I think that the macros and stuff that manage pins and buffer locks in
> nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
> value in trying to be consistent with existing nbtsearch.c code in
> ways that aren't strictly necessary.

Yep, I've seen this thread, but tried to be consistent with the
surrounding core style. Probably it indeed doesn't make sense.

> * Not sure why you need this code after throwing an error:
>
> > else
> > {
> > elog(ERROR, "Could not read closest index tuples: %d", offnum);
> > pfree(so->skipScanKey);
> > so->skipScanKey = NULL;
> > return false;
> > }

Unfortunately this is just a leftover from a previous version. Sorry for
that, will get rid of it.

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-07 15:43:58
Message-ID:	20191107154358.bvwq3ul2ol7cv3lj@localhost
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Nov 03, 2019 at 05:45:47PM +0100, Dmitry Dolgov wrote:
> > * The extra scankeys that you are using in most of the new nbtsearch.c
> > code is an insertion scankey -- not a search style scankey. I think
> > that you should try to be a bit clearer on that distinction in
> > comments. It is already confusing now, but at least there is only zero
> > or one insertion scankeys per scan (for the initial positioning).
> >
> > * There are two _bt_skip() prototypes in nbtree.h (actually, there is
> > a btskip() and a _bt_skip()). I understand that the former is a public
> > wrapper of the latter, but I find the naming a little bit confusing.
> > Maybe rename _bt_skip() to something that is a little bit more
> > suggestive of its purpose.
> >
> > * Suggest running pgindent on the patch.
>
> Sure, I'll incorporate mentioned improvements into the next patch
> version (hopefully soon).

Here is the new version, that addresses mentioned issues.

> > * Is _bt_scankey_within_page() really doing the right thing within empty pages?
> >
> > It looks like you're accidentally using the high key when the leaf
> > page is empty with forward scans (assuming that the leaf page isn't
> > rightmost). You'll need to think about empty pages for both forward
> > and backward direction scans there.
>
> Yes, you're right, that's an issue I need to fix.

If I didn't misunderstood something, for the purpose of this function it
makes sense to return false in the case of empty page. That's what I've
added into the patch.

> > Actually, using the high key in some cases may be desirable, once the
> > details are worked out -- the high key is actually very helpful with
> > low cardinality indexes. If you populate an index with retail
> > insertions (i.e. don't just do a CREATE INDEX after the table is
> > populated), and use low cardinality data in the indexed columns then
> > you'll see this effect.
>
> Can you please elaborate a bit more? I see that using high key will help
> a forward index scans to access the minimum number of leaf pages, but
> I'm not following how is it connected to the _bt_scankey_within_page? Or
> is this commentary related in general to the whole implementation?

This question is still open.

Attachment	Content-Type	Size
v28-0001-Index-skip-scan.patch	text/x-diff	77.8 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-10 21:18:32
Message-ID:	20191110211832.vv4jbscszz74jete@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.

Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.

The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.

A couple more comments:

1) pathkey_is_unique

The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).

My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).

2) explain

I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:

/* try not to be too chatty about this in text mode */
if (es->format != EXPLAIN_FORMAT_TEXT ||
(es->verbose && ((Join *) plan)->inner_unique))
ExplainPropertyBool("Inner Unique",
((Join *) plan)->inner_unique,
es);
break;

I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.

3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries using

DISTINCT a
DISTINCT a,b
DISTINCT a,b,c

but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).

That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:

-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)
Output: a, b, c
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.106 ms
Execution Time: 10.843 ms
(6 rows)

-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual time=802.067..802.612 rows=1331 loops=1)
Output: a, c, b
Group Key: t.a, t.c, t.b
-> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.010..355.361 rows=1000000 loops=1)
Output: a, b, c
Planning Time: 0.076 ms
Execution Time: 803.078 ms
(7 rows)

-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by a,b,c;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)
Output: a, c, b
Skip scan: true
Heap Fetches: 1331
Planning Time: 0.053 ms
Execution Time: 12.632 ms
(6 rows)

This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
skipscan-review.patch	text/plain	7.7 KB

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-11 18:24:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tomas,

On 11/10/19 4:18 PM, Tomas Vondra wrote:
> I've looked at the patch again - in general it seems in pretty good
> shape, all the issues I found are mostly minor.
>
> Firstly, I'd like to point out that not all of the things I complained
> about in my 2019/06/23 review got addressed. Those were mostly related
> to formatting and code style, and the attached patch fixes some (but
> maybe not all) of them.
>

Sorry about that !

> The patch also tweaks wording of some bits in the docs and comments that
> I found unclear. Would be nice if a native speaker could take a look.
>
> A couple more comments:
>
>
> 1) pathkey_is_unique
>
> The one additional issue I found is pathkey_is_unique - it's not really
> explained what "unique" means and hot it's different from "redundant"
> (which has quite a long explanation before pathkey_is_redundant).
>
> My understanding is that pathkey is "unique" when it's EC does not match
> an EC of another pathkey in the list. But if that's the case, then the
> function name is wrong - it does exactly the opposite (i.e. it returns
> 'true' when the pathkey is *not* unique).
>

Yeah, you are correct - forgot to move that part from the _uniquekey
version of the patch.

>
> 2) explain
>
> I wonder if we should print the "Skip scan" info always, or similarly to
> "Inner Unique" which does this:
>
> /* try not to be too chatty about this in text mode */
>     if (es->format != EXPLAIN_FORMAT_TEXT ||
>         (es->verbose && ((Join *) plan)->inner_unique))
>         ExplainPropertyBool("Inner Unique",
>                             ((Join *) plan)->inner_unique,
>                             es);
>     break;
>
> I'd do the same thing for skip scan - print it only in verbose mode, or
> when using non-text output format.
>

I think it is of benefit to see if skip scan kicks in, but used your
"Skip scan" suggestion.

>
> 3) There's an annoying limitation that for this to kick in, the order of
> expressions in the DISTINCT clause has to match the index, i.e. with
> index on (a,b,c) the skip scan will only kick in for queries using
>
>    DISTINCT a
>    DISTINCT a,b
>    DISTINCT a,b,c
>
> but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
> to sort result of DISTINCT in any particular case, except that we don't
> consider the other orderings "interesting" so we don't really consider
> using the index (so no chance of using the skip scan).
>
> That leads to pretty annoying speedups/slowdowns due to seemingly
> irrelevant changes:
>
> -- everything great, a,b,c matches an index
> test=# explain (analyze, verbose) select distinct a,b,c from t;
>                                                              QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25
> rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)
>    Output: a, b, c
>    Skip scan: true
>    Heap Fetches: 1331
> Planning Time: 0.106 ms
> Execution Time: 10.843 ms
> (6 rows)
>
> -- slow, mismatch with index
> test=# explain (analyze, verbose) select distinct a,c,b from t;
>                                                         QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------
>
> HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual
> time=802.067..802.612 rows=1331 loops=1)
>    Output: a, c, b
>    Group Key: t.a, t.c, t.b
>    -> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000
> width=12) (actual time=0.010..355.361 rows=1000000 loops=1)
>          Output: a, b, c
> Planning Time: 0.076 ms
> Execution Time: 803.078 ms
> (7 rows)
>
> -- fast again, the extra ordering allows using the index again
> test=# explain (analyze, verbose) select distinct a,c,b from t order by
> a,b,c;
>                                                              QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25
> rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)
>    Output: a, c, b
>    Skip scan: true
>    Heap Fetches: 1331
> Planning Time: 0.053 ms
> Execution Time: 12.632 ms
> (6 rows)
>
>
> This is a more generic issue, not specific to this patch, of course. I
> think we saw it with the incremental sort patch, IIRC. I wonder how
> difficult would it be to fix this here (not necessarily in v1).
>

Yeah, I see it as separate to this patch as well. But definitely
something that should be revisited.

Thanks for your patch ! v29 using UniqueKey attached.

Best regards,
Jesper

Attachment	Content-Type	Size
v29_0001-Unique-key.patch	text/x-patch	25.8 KB
v29_0002-Index-skip-scan.patch	text/x-patch	69.3 KB

From:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject:	Re: Index Skip Scan
Date:	2019-11-15 15:05:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 11/11/19 1:24 PM, Jesper Pedersen wrote:
> v29 using UniqueKey attached.
>

Just a small update to the UniqueKey patch to hopefully keep CFbot happy.

Feedback, especially on the planner changes, would be greatly appreciated.

Best regards,
Jesper

Attachment	Content-Type	Size
v30_0001-Unique-key.patch	text/x-patch	25.3 KB
v30_0002-Index-skip-scan.patch	text/x-patch	69.3 KB