Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit dd299df8 (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/[email protected] https://postgr.es/m/[email protected]
author: Peter Geoghegan 2020-02-26 21:05:30 +0000
committer: Peter Geoghegan 2020-02-26 21:05:30 +0000
commit: 0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27 (patch)
tree: 63c5bfea98211202f7d654a93f7e4f24a8ecaf45 /doc/src/sgml/btree.sgml
parent: 612a1ab76724aa1514b6509269342649f8cab375 (diff)
1 files changed, 199 insertions, 2 deletions
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index fcf771c857f..f02e02b0acc 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -557,11 +557,208 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  This section covers B-Tree index implementation details that may be
+  of use to advanced users.  See
+  <filename>src/backend/access/nbtree/README</filename> in the source
+  distribution for a much more detailed, internals-focused description
+  of the B-Tree implementation.
+ </para>
+ <sect2 id="btree-structure">
+  <title>B-Tree Structure</title>
+  <para>
+   <productname>PostgreSQL</productname> B-Tree indexes are
+   multi-level tree structures, where each level of the tree can be
+   used as a doubly-linked list of pages.  A single metapage is stored
+   in a fixed position at the start of the first segment file of the
+   index.  All other pages are either leaf pages or internal pages.
+   Leaf pages are the pages on the lowest level of the tree.  All
+   other levels consist of internal pages.  Each leaf page contains
+   tuples that point to table rows.  Each internal page contains
+   tuples that point to the next level down in the tree.  Typically,
+   over 99% of all pages are leaf pages.  Both internal pages and leaf
+   pages use the standard page format described in <xref
+    linkend="storage-page-layout"/>.
+  </para>
+  <para>
+   New leaf pages are added to a B-Tree index when an existing leaf
+   page cannot fit an incoming tuple.  A <firstterm>page
+    split</firstterm> operation makes room for items that originally
+   belonged on the overflowing page by moving a portion of the items
+   to a new page.  Page splits must also insert a new
+   <firstterm>downlink</firstterm> to the new page in the parent page,
+   which may cause the parent to split in turn.  Page splits
+   <quote>cascade upwards</quote> in a recursive fashion.  When the
+   root page finally cannot fit a new downlink, a <firstterm>root page
+    split</firstterm> operation takes place.  This adds a new level to
+   the tree structure by creating a new root page that is one level
+   above the original root page.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
+  <para>
+   A duplicate is a leaf page tuple (a tuple that points to a table
+   row) where <emphasis>all</emphasis> indexed key columns have values
+   that match corresponding column values from at least one other leaf
+   page tuple that's close by in the same index.  Duplicate tuples are
+   quite common in practice.  B-Tree indexes can use a special,
+   space-efficient representation for duplicates when an optional
+   technique is enabled: <firstterm>deduplication</firstterm>.
+  </para>
+  <para>
+   Deduplication works by periodically merging groups of duplicate
+   tuples together, forming a single posting list tuple for each
+   group.  The column key value(s) only appear once in this
+   representation.  This is followed by a sorted array of
+   <acronym>TID</acronym>s that point to rows in the table.  This
+   significantly reduces the storage size of indexes where each value
+   (or each distinct combination of column values) appears several
+   times on average.  The latency of queries can be reduced
+   significantly.  Overall query throughput may increase
+   significantly.  The overhead of routine index vacuuming may also be
+   reduced significantly.
+  </para>
+  <note>
+   <para>
+    While NULL is generally not considered to be equal to any other
+    value, including NULL, NULL is nevertheless treated as just
+    another value from the domain of indexed values by the B-Tree
+    implementation (except when enforcing uniqueness in a unique
+    index).  B-Tree deduplication is therefore just as effective with
+    <quote>duplicates</quote> that contain a NULL value.
+   </para>
+  </note>
+  <para>
+   The deduplication process occurs lazily, when a new item is
+   inserted that cannot fit on an existing leaf page.  This prevents
+   (or at least delays) leaf page splits.  Unlike GIN posting list
+   tuples, B-Tree posting list tuples do not need to expand every time
+   a new duplicate is inserted; they are merely an alternative
+   physical representation of the original logical contents of the
+   leaf page.  This design prioritizes consistent performance with
+   mixed read-write workloads.  Most client applications will at least
+   see a moderate performance benefit from using deduplication.
+   Deduplication is enabled by default.
+  </para>
+  <para>
+   Write-heavy workloads that don't benefit from deduplication due to
+   having few or no duplicate values in indexes will incur a small,
+   fixed performance penalty (unless deduplication is explicitly
+   disabled).  The <literal>deduplicate_items</literal> storage
+   parameter can be used to disable deduplication within individual
+   indexes.  There is never any performance penalty with read-only
+   workloads, since reading posting list tuples is at least as
+   efficient as reading the standard tuple representation.  Disabling
+   deduplication isn't usually helpful.
+  </para>
+  <para>
+   B-Tree indexes are not directly aware that under MVCC, there might
+   be multiple extant versions of the same logical table row; to an
+   index, each tuple is an independent object that needs its own index
+   entry.  Thus, an update of a row always creates all-new index
+   entries for the row, even if the key values did not change.  Some
+   workloads suffer from index bloat caused by these
+   implementation-level version duplicates (this is typically a
+   problem for <command>UPDATE</command>-heavy workloads that cannot
+   apply the <acronym>HOT</acronym> optimization due to modifying at
+   least one indexed column).  B-Tree deduplication does not
+   distinguish between these implementation-level version duplicates
+   and conventional duplicates.  Deduplication can nevertheless help
+   with controlling index bloat caused by implementation-level version
+   churn.
+  </para>
+  <tip>
+   <para>
+    A special heuristic is applied to determine whether a
+    deduplication pass in a unique index should take place.  It can
+    often skip straight to splitting a leaf page, avoiding a
+    performance penalty from wasting cycles on unhelpful deduplication
+    passes.  If you're concerned about the overhead of deduplication,
+    consider setting <literal>deduplicate_items = off</literal>
+    selectively.  Leaving deduplication enabled in unique indexes has
+    little downside.
+   </para>
+  </tip>
+  <para>
+   Deduplication cannot be used in all cases due to
+   implementation-level restrictions.  Deduplication safety is
+   determined when <command>CREATE INDEX</command> or
+   <command>REINDEX</command> run.
+  </para>
+  <para>
+   Note that deduplication is deemed unsafe and cannot be used in the
+   following cases involving semantically significant differences
+   among equal datums:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      <type>text</type>, <type>varchar</type>, and <type>char</type>
+      cannot use deduplication when a
+      <emphasis>nondeterministic</emphasis> collation is used.  Case
+      and accent differences must be preserved among equal datums.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>numeric</type> cannot use deduplication.  Numeric display
+      scale must be preserved among equal datums.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>jsonb</type> cannot use deduplication, since the
+      <type>jsonb</type> B-Tree operator class uses
+      <type>numeric</type> internally.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>float4</type> and <type>float8</type> cannot use
+      deduplication.  These types have distinct representations for
+      <literal>-0</literal> and <literal>0</literal>, which are
+      nevertheless considered equal.  This difference must be
+      preserved.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+  <para>
+   There is one further implementation-level restriction that may be
+   lifted in a future version of
+   <productname>PostgreSQL</productname>:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      Container types (such as composite types, arrays, or range
+      types) cannot use deduplication.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+  <para>
+   There is one further implementation-level restriction that applies
+   regardless of the operator class or collation used:
+  </para>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   <itemizedlist>
+    <listitem>
+     <para>
+      <literal>INCLUDE</literal> indexes can never use deduplication.
+     </para>
+    </listitem>
+   </itemizedlist>
   </para>
 
+ </sect2>
 </sect1>
 
 </chapter>
author	Peter Geoghegan	2020-02-26 21:05:30 +0000
committer	Peter Geoghegan	2020-02-26 21:05:30 +0000
commit	0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27 (patch)
tree	63c5bfea98211202f7d654a93f7e4f24a8ecaf45 /doc/src/sgml/btree.sgml
parent	612a1ab76724aa1514b6509269342649f8cab375 (diff)