Age | Commit message (Collapse) | Author |
|
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns. When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan. The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays. Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys. This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan). In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index. Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument. These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However,
improved performance isn't guaranteed. Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Masahiro Ikeda <[email protected]>
Reviewed-By: Heikki Linnakangas <[email protected]>
Reviewed-By: Matthias van de Meent <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Reviewed-By: Aleksander Alekseev <[email protected]>
Reviewed-By: Alena Rybakina <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
|
|
The documentation index is getting very long, which makes it hard
to find things. Since these chapters are all very similar in structure
and content, merging them is a natural way of reducing the size of
the toplevel index.
Rather than actually combining all of the SGML into a single file,
keep one file per <sect1>, and add a glue file that includes all
of them.
Discussion: http://postgr.es/m/CA+Tgmob7_uoYuS2=rVwpVXaRwP-UXz+++saYTC-BCZ42QzSNKQ@mail.gmail.com
|
|
AM was used throughout the documentation to denote Access Method, but
the acronym was not described. This adds an acronym entry as well as
a glossary term which the acronym links to. Each page which describe
AMs have the first occurrence with <acronym> markup.
Reported-by: [email protected]
Reviewed-by: Alvaro Herrera <[email protected]>
Discussion: https://postgr.es/m/[email protected]
|
|
Reported-by: Jonathan S. Katz
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 11
|
|
Author: Ekaterina Kiryanova
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14
|
|
|
|
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions. This is "bottom-up
deletion". Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page. This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).
The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates. It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_. Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index. The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.
The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs. That is left as work for a
later commit.
Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing. This increases the number of index tuples deleted
significantly in many cases. The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples. We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).
Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages. That is, it almost never fails
to delete at least a few extra index tuples. It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples. It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).
Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples. It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly). Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.
Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.
No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes. Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade. The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.
Author: Peter Geoghegan <[email protected]>
Reviewed-By: Heikki Linnakangas <[email protected]>
Reviewed-By: Victor Yegorov <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
|
|
99% of this is docs, but also a couple of comments. No code changes.
Justin Pryzby
Discussion: https://postgr.es/m/[email protected]
|
|
Alexander Lakhin
Discussion: https://postgr.es/m/[email protected]
|
|
All the documentation of index AMs has been using <replaceable> for
local_relopts. This is a structure, so <structname> is a much better
choice.
Alexander has found the inconsistency for btree, while I have spotted
the rest when applying the concept of consistency to the docs.
Author: Alexander Lakhin, Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/[email protected]
|
|
Discussion: https://postgr.es/m/20200620232145.GB17995%40telsasoft.com
Author: Justin Pryzby
Backpatch-through: 13
|
|
Defining duplicates as "close by" to each other was unclear. Simplify
the definition.
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
|
|
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-WzmwhYbxuoL0WjTLaiCxW3gj6qadeNpBhWAo_KZsE5-FGw%40mail.gmail.com
|
|
911e7020770 added opclass options and adjusted documentation for each
particular affected opclass. However, documentation for extendability was
not adjusted. This commit adjusts documentation for interfaces of index AMs
and opclasses.
Discussion: https://postgr.es/m/CAH2-WzmQnW6%2Bz5F9AW%2BSz%2BzEcEvXofTwh_A9J3%3D_WA-FBP0wYg%40mail.gmail.com
Author: Alexander Korotkov
Reported-by: Peter Geoghegan
Reviewed-by: Peter Geoghegan
|
|
The B-Tree index deduplication strategy used during CREATE INDEX and
REINDEX differs from the lazy strategy used by retail inserts. Make
that clear by adding a new paragraph to the B-Tree implementation
section of the documentation.
In passing, do some copy-editing of nearby deduplication documentation.
|
|
This fixes some comments and documentation new as of Postgres 13.
Author: Justin Pryzby
Discussion: https://postgr.es/m/[email protected]
|
|
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/[email protected]
https://postgr.es/m/[email protected]
|
|
Invent the concept of a B-Tree equalimage ("equality implies image
equality") support function, registered as support function 4. This
indicates whether it is safe (or not safe) to apply optimizations that
assume that any two datums considered equal by an operator class's order
method must be interchangeable without any loss of semantic information.
This is static information about an operator class and a collation.
Register an equalimage routine for almost all of the existing B-Tree
opclasses. We only need two trivial routines for all of the opclasses
that are included with the core distribution. There is one routine for
opclasses that index non-collatable types (which returns 'true'
unconditionally), plus another routine for collatable types (which
returns 'true' when the collation is a deterministic collation).
This patch is infrastructure for an upcoming patch that adds B-Tree
deduplication.
Author: Peter Geoghegan, Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
|
|
Use a top-level "variablelist", with one item per B-Tree support
function. This structure matches the structure used by various
"Extensibility" sections in other documentation chapters for other index
access methods.
An explicit list makes it much clearer where each item begins and ends.
This wasn't really a problem before now, but an upcoming patch that adds
deduplication to nbtree will need to have its own new B-Tree support
function. Ease the burden of translators by tidying up btree.sgml ahead
of committing the deduplication patch.
|
|
|
|
Acronym "btree" better means "multi-way balanced tree" rather than
"multi-way binary tree".
Discussion: https://postgr.es/m/20190105.183532.1686260542006440682.t-ishii%40sraoss.co.jp
|
|
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result. However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions. Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.
The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.
To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN". That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.
This is a longstanding portability hazard, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/[email protected]
|
|
Discussing covering indexes in a chapter that is mostly about the
behavior of B-Tree operator classes is unnecessary. The CREATE INDEX
documentation's handling of covering indexes seems sufficient.
Discussion: https://postgr.es/m/CAH2-WzmpU=L_6VjhhOAMfoyHLr-pZd1kDc+jpa3c3a8EOmtcXA@mail.gmail.com
|
|
Discussion: https://www.postgresql.org/message-id/dff3d555-bea4-ac24-29b2-29521b9d08e8%402ndquadrant.com
|
|
This patch introduces INCLUDE clause to index definition. This clause
specifies a list of columns which will be included as a non-key part in
the index. The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans. Also, such columns don't need to have
appropriate operator classes. Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.
Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine. For now, only B-tree indexes support INCLUDE clause.
In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys). Therefore, B-tree indexes
now might have variable number of attributes. This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid. Free 13th bit of t_info is used for indicating
that. This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.
Bump catalog version
Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/[email protected]
|
|
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
|
|
Up to now, useful info for writing a new btree opclass has been buried
in the backend's nbtree/README file. Let's move it into the SGML docs,
in preparation for extending it with info about "in_range" functions
in the upcoming window RANGE patch.
To do this, I chose to create a new chapter for btree indexes in Part VII
(Internals), parallel to the chapters that exist for the newer index AMs.
This is a pretty short chapter as-is. At some point somebody might care
to flesh it out with more detail about btree internals, but that is
beyond the scope of my ambition for today.
Discussion: https://postgr.es/m/[email protected]
|