Skip to content

Commit 92fe23d

Browse files
Add nbtree skip scan optimization.
Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <[email protected]> Reviewed-By: Masahiro Ikeda <[email protected]> Reviewed-By: Heikki Linnakangas <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Reviewed-By: Tomas Vondra <[email protected]> Reviewed-By: Aleksander Alekseev <[email protected]> Reviewed-By: Alena Rybakina <[email protected]> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
1 parent 3ba2cda commit 92fe23d

35 files changed

+3024
-381
lines changed

doc/src/sgml/btree.sgml

+30-1
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,7 @@
207207

208208
<para>
209209
As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
210-
one required and four optional support functions. The five
210+
one required and five optional support functions. The six
211211
user-defined methods are:
212212
</para>
213213
<variablelist>
@@ -583,6 +583,35 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
583583
</para>
584584
</listitem>
585585
</varlistentry>
586+
<varlistentry>
587+
<term><function>skipsupport</function></term>
588+
<listitem>
589+
<para>
590+
Optionally, a btree operator family may provide a <firstterm>skip
591+
support</firstterm> function, registered under support function number 6.
592+
These functions give the B-tree code a way to iterate through every
593+
possible value that can be represented by an operator class's underlying
594+
input type, in key space order. This is used by the core code when it
595+
applies the skip scan optimization. The APIs involved in this are
596+
defined in <filename>src/include/utils/skipsupport.h</filename>.
597+
</para>
598+
<para>
599+
Operator classes that do not provide a skip support function are still
600+
eligible to use skip scan. The core code can still use its fallback
601+
strategy, though that might be suboptimal for some discrete types. It
602+
usually doesn't make sense (and may not even be feasible) for operator
603+
classes on continuous types to provide a skip support function.
604+
</para>
605+
<para>
606+
It is not sensible for an operator family to register a cross-type
607+
<function>skipsupport</function> function, and attempting to do so will
608+
result in an error. This is because determining the next indexable value
609+
must happen by incrementing a value copied from an index tuple. The
610+
values generated must all be of the same underlying data type (the
611+
<quote>skipped</quote> index column's opclass input type).
612+
</para>
613+
</listitem>
614+
</varlistentry>
586615
</variablelist>
587616

588617
</sect2>

doc/src/sgml/indexam.sgml

+2-1
Original file line numberDiff line numberDiff line change
@@ -835,7 +835,8 @@ amrestrpos (IndexScanDesc scan);
835835
<para>
836836
<programlisting>
837837
Size
838-
amestimateparallelscan (int nkeys,
838+
amestimateparallelscan (Relation indexRelation,
839+
int nkeys,
839840
int norderbys);
840841
</programlisting>
841842
Estimate and return the number of bytes of dynamic shared memory which

doc/src/sgml/indices.sgml

+54-14
Original file line numberDiff line numberDiff line change
@@ -460,20 +460,56 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
460460
efficient when there are constraints on the leading (leftmost) columns.
461461
The exact rule is that equality constraints on leading columns, plus
462462
any inequality constraints on the first column that does not have an
463-
equality constraint, will be used to limit the portion of the index
463+
equality constraint, will always be used to limit the portion of the index
464464
that is scanned. Constraints on columns to the right of these columns
465-
are checked in the index, so they save visits to the table proper, but
466-
they do not reduce the portion of the index that has to be scanned.
465+
are checked in the index, so they'll always save visits to the table
466+
proper, but they do not necessarily reduce the portion of the index that
467+
has to be scanned. If a B-tree index scan can apply the skip scan
468+
optimization effectively, it will apply every column constraint when
469+
navigating through the index via repeated index searches. This can reduce
470+
the portion of the index that has to be read, even though one or more
471+
columns (prior to the least significant index column from the query
472+
predicate) lacks a conventional equality constraint. Skip scan works by
473+
generating a dynamic equality constraint internally, that matches every
474+
possible value in an index column (though only given a column that lacks
475+
an equality constraint that comes from the query predicate, and only when
476+
the generated constraint can be used in conjunction with a later column
477+
constraint from the query predicate).
478+
</para>
479+
480+
<para>
481+
For example, given an index on <literal>(x, y)</literal>, and a query
482+
condition <literal>WHERE y = 7700</literal>, a B-tree index scan might be
483+
able to apply the skip scan optimization. This generally happens when the
484+
query planner expects that repeated <literal>WHERE x = N AND y = 7700</literal>
485+
searches for every possible value of <literal>N</literal> (or for every
486+
<literal>x</literal> value that is actually stored in the index) is the
487+
fastest possible approach, given the available indexes on the table. This
488+
approach is generally only taken when there are so few distinct
489+
<literal>x</literal> values that the planner expects the scan to skip over
490+
most of the index (because most of its leaf pages cannot possibly contain
491+
relevant tuples). If there are many distinct <literal>x</literal> values,
492+
then the entire index will have to be scanned, so in most cases the planner
493+
will prefer a sequential table scan over using the index.
494+
</para>
495+
496+
<para>
497+
The skip scan optimization can also be applied selectively, during B-tree
498+
scans that have at least some useful constraints from the query predicate.
467499
For example, given an index on <literal>(a, b, c)</literal> and a
468500
query condition <literal>WHERE a = 5 AND b &gt;= 42 AND c &lt; 77</literal>,
469-
the index would have to be scanned from the first entry with
470-
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
471-
<literal>a</literal> = 5. Index entries with <literal>c</literal> &gt;= 77 would be
472-
skipped, but they'd still have to be scanned through.
473-
This index could in principle be used for queries that have constraints
474-
on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
475-
&mdash; but the entire index would have to be scanned, so in most cases
476-
the planner would prefer a sequential table scan over using the index.
501+
the index might have to be scanned from the first entry with
502+
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last
503+
entry with <literal>a</literal> = 5. Index entries with
504+
<literal>c</literal> &gt;= 77 will never need to be filtered at the table
505+
level, but it may or may not be profitable to skip over them within the
506+
index. When skipping takes place, the scan starts a new index search to
507+
reposition itself from the end of the current <literal>a</literal> = 5 and
508+
<literal>b</literal> = N grouping (i.e. from the position in the index
509+
where the first tuple <literal>a = 5 AND b = N AND c &gt;= 77</literal>
510+
appears), to the start of the next such grouping (i.e. the position in the
511+
index where the first tuple <literal>a = 5 AND b = N + 1</literal>
512+
appears).
477513
</para>
478514

479515
<para>
@@ -669,9 +705,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
669705
multicolumn index on <literal>(x, y)</literal>. This index would typically be
670706
more efficient than index combination for queries involving both
671707
columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
672-
would be almost useless for queries involving only <literal>y</literal>, so it
673-
should not be the only index. A combination of the multicolumn index
674-
and a separate index on <literal>y</literal> would serve reasonably well. For
708+
would be less useful for queries involving only <literal>y</literal>. Just
709+
how useful will depend on how effective the B-tree index skip scan
710+
optimization is; if <literal>x</literal> has no more than several hundred
711+
distinct values, skip scan will make searches for specific
712+
<literal>y</literal> values execute reasonably efficiently. A combination
713+
of a multicolumn index on <literal>(x, y)</literal> and a separate index on
714+
<literal>y</literal> might also serve reasonably well. For
675715
queries involving only <literal>x</literal>, the multicolumn index could be
676716
used, though it would be larger and hence slower than an index on
677717
<literal>x</literal> alone. The last alternative is to create all three

doc/src/sgml/monitoring.sgml

+4-1
Original file line numberDiff line numberDiff line change
@@ -4263,7 +4263,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
42634263
<replaceable>column_name</replaceable> =
42644264
<replaceable>value2</replaceable> ...</literal> construct, though only
42654265
when the optimizer transforms the construct into an equivalent
4266-
multi-valued array representation.
4266+
multi-valued array representation. Similarly, when B-tree index scans use
4267+
the skip scan optimization, an index search is performed each time the
4268+
scan is repositioned to the next index leaf page that might have matching
4269+
tuples (see <xref linkend="indexes-multicolumn"/>).
42674270
</para>
42684271
</note>
42694272
<tip>

doc/src/sgml/perform.sgml

+32
Original file line numberDiff line numberDiff line change
@@ -860,6 +860,38 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE thousand IN (1, 2, 3, 4);
860860
<structname>tenk1_thous_tenthous</structname> index leaf page.
861861
</para>
862862

863+
<para>
864+
The <quote>Index Searches</quote> line is also useful with B-tree index
865+
scans that apply the <firstterm>skip scan</firstterm> optimization to
866+
more efficiently traverse through an index:
867+
<screen>
868+
EXPLAIN ANALYZE SELECT four, unique1 FROM tenk1 WHERE four BETWEEN 1 AND 3 AND unique1 = 42;
869+
QUERY PLAN
870+
-------------------------------------------------------------------&zwsp;---------------------------------------------------------------
871+
Index Only Scan using tenk1_four_unique1_idx on tenk1 (cost=0.29..6.90 rows=1 width=8) (actual time=0.006..0.007 rows=1.00 loops=1)
872+
Index Cond: ((four &gt;= 1) AND (four &lt;= 3) AND (unique1 = 42))
873+
Heap Fetches: 0
874+
Index Searches: 3
875+
Buffers: shared hit=7
876+
Planning Time: 0.029 ms
877+
Execution Time: 0.012 ms
878+
</screen>
879+
880+
Here we see an Index-Only Scan node using
881+
<structname>tenk1_four_unique1_idx</structname>, a multi-column index on the
882+
<structname>tenk1</structname> table's <structfield>four</structfield> and
883+
<structfield>unique1</structfield> columns. The scan performs 3 searches
884+
that each read a single index leaf page:
885+
<quote><literal>four = 1 AND unique1 = 42</literal></quote>,
886+
<quote><literal>four = 2 AND unique1 = 42</literal></quote>, and
887+
<quote><literal>four = 3 AND unique1 = 42</literal></quote>. This index
888+
is generally a good target for skip scan, since, as discussed in
889+
<xref linkend="indexes-multicolumn"/>, its leading column (the
890+
<structfield>four</structfield> column) contains only 4 distinct values,
891+
while its second/final column (the <structfield>unique1</structfield>
892+
column) contains many distinct values.
893+
</para>
894+
863895
<para>
864896
Another type of extra information is the number of rows removed by a
865897
filter condition:

doc/src/sgml/xindex.sgml

+13-3
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,13 @@
461461
</entry>
462462
<entry>5</entry>
463463
</row>
464+
<row>
465+
<entry>
466+
Return the addresses of C-callable skip support function(s)
467+
(optional)
468+
</entry>
469+
<entry>6</entry>
470+
</row>
464471
</tbody>
465472
</tgroup>
466473
</table>
@@ -1062,7 +1069,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
10621069
FUNCTION 1 btint8cmp(int8, int8) ,
10631070
FUNCTION 2 btint8sortsupport(internal) ,
10641071
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
1065-
FUNCTION 4 btequalimage(oid) ;
1072+
FUNCTION 4 btequalimage(oid) ,
1073+
FUNCTION 6 btint8skipsupport(internal) ;
10661074

10671075
CREATE OPERATOR CLASS int4_ops
10681076
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1075,7 +1083,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
10751083
FUNCTION 1 btint4cmp(int4, int4) ,
10761084
FUNCTION 2 btint4sortsupport(internal) ,
10771085
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
1078-
FUNCTION 4 btequalimage(oid) ;
1086+
FUNCTION 4 btequalimage(oid) ,
1087+
FUNCTION 6 btint4skipsupport(internal) ;
10791088

10801089
CREATE OPERATOR CLASS int2_ops
10811090
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1088,7 +1097,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
10881097
FUNCTION 1 btint2cmp(int2, int2) ,
10891098
FUNCTION 2 btint2sortsupport(internal) ,
10901099
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
1091-
FUNCTION 4 btequalimage(oid) ;
1100+
FUNCTION 4 btequalimage(oid) ,
1101+
FUNCTION 6 btint2skipsupport(internal) ;
10921102

10931103
ALTER OPERATOR FAMILY integer_ops USING btree ADD
10941104
-- cross-type comparisons int8 vs int2

src/backend/access/index/indexam.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,8 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
489489
if (parallel_aware &&
490490
indexRelation->rd_indam->amestimateparallelscan != NULL)
491491
nbytes = add_size(nbytes,
492-
indexRelation->rd_indam->amestimateparallelscan(nkeys,
492+
indexRelation->rd_indam->amestimateparallelscan(indexRelation,
493+
nkeys,
493494
norderbys));
494495

495496
return nbytes;

0 commit comments

Comments
 (0)