src/backend/access/heap/README.HOT

   1 src/backend/access/heap/README.HOT
   2
   3 Heap Only Tuples (HOT)
   4 ======================
   5
   6 The Heap Only Tuple (HOT) feature eliminates redundant index entries and
   7 allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples
   8 without performing a table-wide vacuum.  It does this by allowing
   9 single-page vacuuming, also called "defragmentation" or "pruning".
  10
  11 Note: there is a Glossary at the end of this document that may be helpful
  12 for first-time readers.
  13
  14
  15 Technical Challenges
  16 --------------------
  17
  18 Page-at-a-time vacuuming is normally impractical because of the costs of
  19 finding and removing the index entries that link to the tuples to be
  20 reclaimed.  Standard vacuuming scans the indexes to ensure all such index
  21 entries are removed, amortizing the index scan cost across as many dead
  22 tuples as possible; this approach does not scale down well to the case of
  23 reclaiming just a few tuples.  In principle one could recompute the index
  24 keys and do standard index searches to find the index entries, but this is
  25 risky in the presence of possibly-buggy user-defined functions in
  26 functional indexes.  An allegedly immutable function that in fact is not
  27 immutable might prevent us from re-finding an index entry (and we cannot
  28 throw an error for not finding it, in view of the fact that dead index
  29 entries are sometimes reclaimed early).  That would lead to a seriously
  30 corrupt index, in the form of entries pointing to tuple slots that by now
  31 contain some unrelated content.  In any case we would prefer to be able
  32 to do vacuuming without invoking any user-written code.
  33
  34 HOT solves this problem for two restricted but useful special cases:
  35
  36 First, where a tuple is repeatedly updated in ways that do not change
  37 its indexed columns.  (Here, "indexed column" means any column referenced
  38 at all in an index definition, including for example columns that are
  39 tested in a partial-index predicate but are not stored in the index.)
  40
  41 Second, where the modified columns are only used in indexes that do not
  42 contain tuple IDs, but maintain summaries of the indexed data by block.
  43 As these indexes don't contain references to individual tuples, they
  44 can't remove tuple references in VACUUM, and thus don't need to get a new
  45 and unique reference to a tuple.  These indexes still need to be notified
  46 of the new column data, but don't need a new HOT chain to be established.
  47
  48 An additional property of HOT is that it reduces index size by avoiding
  49 the creation of identically-keyed index entries.  This improves search
  50 speeds.
  51
  52
  53 Update Chains With a Single Index Entry
  54 ---------------------------------------
  55
  56 Without HOT, every version of a row in an update chain has its own index
  57 entries, even if all indexed columns are the same.  With HOT, a new tuple
  58 placed on the same page and with all indexed columns the same as its
  59 parent row version does not get new index entries.  This means there is
  60 only one index entry for the entire update chain on the heap page.
  61 An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag.
  62 The prior row version is marked HEAP_HOT_UPDATED, and (as always in an
  63 update chain) its t_ctid field links forward to the newer version.
  64
  65 For example:
  66
  67     Index points to 1
  68     lp [1]  [2]
  69
  70     [111111111]->[2222222222]
  71
  72 In the above diagram, the index points to line pointer 1, and tuple 1 is
  73 marked as HEAP_HOT_UPDATED.  Tuple 2 is a HOT tuple, meaning it has
  74 no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE.
  75 Although tuple 2 is not directly referenced by the index, it can still be
  76 found by an index search: after traversing from the index to tuple 1,
  77 the index search proceeds forward to child tuples as long as it sees the
  78 HEAP_HOT_UPDATED flag set.  Since we restrict the HOT chain to lie within
  79 a single page, this requires no additional page fetches and doesn't
  80 introduce much performance penalty.
  81
  82 Eventually, tuple 1 will no longer be visible to any transaction.
  83 At that point its space could be reclaimed, but its line pointer cannot,
  84 since the index still links to that line pointer and we still need to
  85 be able to find tuple 2 in an index search.  HOT handles this by turning
  86 line pointer 1 into a "redirecting line pointer", which links to tuple 2
  87 but has no actual tuple attached.  This state of affairs looks like
  88
  89     Index points to 1
  90     lp [1]->[2]
  91
  92     [2222222222]
  93
  94 If now the row is updated again, to version 3, the page looks like this:
  95
  96     Index points to 1
  97     lp [1]->[2]  [3]
  98
  99     [2222222222]->[3333333333]
 100
 101 At some later time when no transaction can see tuple 2 in its snapshot,
 102 tuple 2 and its line pointer can be pruned entirely:
 103
 104     Index points to 1
 105     lp [1]------>[3]
 106
 107     [3333333333]
 108
 109 This is safe because no index entry points to line pointer 2.  Subsequent
 110 insertions into the page can now recycle both line pointer 2 and the
 111 space formerly used by tuple 2.
 112
 113 If an update changes any column indexed by a non-summarizing indexes, or
 114 if there is not room on the same page for the new tuple, then the HOT
 115 chain ends: the last member has a regular t_ctid link to the next version
 116 and is not marked HEAP_HOT_UPDATED.  (In principle we could continue a
 117 HOT chain across pages, but this would destroy the desired property of
 118 being able to reclaim space with just page-local manipulations.  Anyway,
 119 we don't want to have to chase through multiple heap pages to get from an
 120 index entry to the desired tuple, so it seems better to create a new
 121 index entry for the new tuple.)  If further updates occur, the next
 122 version could become the root of a new HOT chain.
 123
 124 Line pointer 1 has to remain as long as there is any non-dead member of
 125 the chain on the page.  When there is not, it is marked "dead".
 126 This lets us reclaim the last child line pointer and associated tuple
 127 immediately.  The next regular VACUUM pass can reclaim the index entries
 128 pointing at the line pointer and then the line pointer itself.  Since a
 129 line pointer is small compared to a tuple, this does not represent an
 130 undue space cost.
 131
 132 Note: we can use a "dead" line pointer for any DELETEd tuple,
 133 whether it was part of a HOT chain or not.  This allows space reclamation
 134 in advance of running VACUUM for plain DELETEs as well as HOT updates.
 135
 136 The requirement for doing a HOT update is that indexes which point to
 137 the root line pointer (and thus need to be cleaned up by VACUUM when the
 138 tuple is dead) do not reference columns which are updated in that HOT
 139 chain.  Summarizing indexes (such as BRIN) are assumed to have no
 140 references to individual tuples and thus are ignored when checking HOT
 141 applicability.  The updated columns are checked at execution time by
 142 comparing the binary representation of the old and new values.  We insist
 143 on bitwise equality rather than using datatype-specific equality routines.
 144 The main reason to avoid the latter is that there might be multiple
 145 notions of equality for a datatype, and we don't know exactly which one
 146 is relevant for the indexes at hand.  We assume that bitwise equality
 147 guarantees equality for all purposes.
 148
 149 If any columns that are included by non-summarizing indexes are updated,
 150 the HOT optimization is not applied, and the new tuple is inserted into
 151 all indexes of the table.  If none of the updated columns are included in
 152 the table's indexes, the HOT optimization is applied and no indexes are
 153 updated.  If instead the updated columns are only indexed by summarizing
 154 indexes, the HOT optimization is applied, but the update is propagated to
 155 all summarizing indexes.  (Realistically, we only need to propagate the
 156 update to the indexes that contain the updated values, but that is yet to
 157 be implemented.)
 158
 159 Abort Cases
 160 -----------
 161
 162 If a heap-only tuple's xmin is aborted, then it can be removed immediately:
 163 it was never visible to any other transaction, and all descendant row
 164 versions must be aborted as well.  Therefore we need not consider it part
 165 of a HOT chain.  By the same token, if a HOT-updated tuple's xmax is
 166 aborted, there is no need to follow the chain link.  However, there is a
 167 race condition here: the transaction that did the HOT update might abort
 168 between the time we inspect the HOT-updated tuple and the time we reach
 169 the descendant heap-only tuple.  It is conceivable that someone prunes
 170 the heap-only tuple before that, and even conceivable that the line pointer
 171 is re-used for another purpose.  Therefore, when following a HOT chain,
 172 it is always necessary to be prepared for the possibility that the
 173 linked-to line pointer is unused, dead, or redirected; and if it is a
 174 normal line pointer, we still have to check that XMIN of the tuple matches
 175 the XMAX of the tuple we left.  Otherwise we should assume that we have
 176 come to the end of the HOT chain.  Note that this sort of XMIN/XMAX
 177 matching is required when following ordinary update chains anyway.
 178
 179 (Early versions of the HOT code assumed that holding pin on the page
 180 buffer while following a HOT link would prevent this type of problem,
 181 but checking XMIN/XMAX matching is a much more robust solution.)
 182
 183
 184 Index/Sequential Scans
 185 ----------------------
 186
 187 When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose
 188 xmax is not aborted, we need to follow its t_ctid link and check that
 189 entry as well; possibly repeatedly until we reach the end of the HOT
 190 chain.  (When using an MVCC snapshot it is possible to optimize this a
 191 bit: there can be at most one visible tuple in the chain, so we can stop
 192 when we find it.  This rule does not work for non-MVCC snapshots, though.)
 193
 194 Sequential scans do not need to pay attention to the HOT links because
 195 they scan every line pointer on the page anyway.  The same goes for a
 196 bitmap heap scan with a lossy bitmap.
 197
 198
 199 Pruning
 200 -------
 201
 202 HOT pruning means updating line pointers so that HOT chains are
 203 reduced in length, by collapsing out line pointers for intermediate dead
 204 tuples.  Although this makes those line pointers available for re-use,
 205 it does not immediately make the space occupied by their tuples available.
 206
 207
 208 Defragmentation
 209 ---------------
 210
 211 Defragmentation centralizes unused space.  After we have converted root
 212 line pointers to redirected line pointers and pruned away any dead
 213 intermediate line pointers, the tuples they linked to are free space.
 214 But unless that space is adjacent to the central "hole" on the page
 215 (the pd_lower-to-pd_upper area) it cannot be used by tuple insertion.
 216 Defragmentation moves the surviving tuples to coalesce all the free
 217 space into one "hole".  This is done with the same PageRepairFragmentation
 218 function that regular VACUUM uses.
 219
 220
 221 When can/should we prune or defragment?
 222 ---------------------------------------
 223
 224 This is the most interesting question in HOT implementation, since there
 225 is no simple right answer: we must use heuristics to determine when it's
 226 most efficient to perform pruning and/or defragmenting.
 227
 228 We cannot prune or defragment unless we can get a "buffer cleanup lock"
 229 on the target page; otherwise, pruning might destroy line pointers that
 230 other backends have live references to, and defragmenting might move
 231 tuples that other backends have live pointers to.  Thus the general
 232 approach must be to heuristically decide if we should try to prune
 233 or defragment, and if so try to acquire the buffer cleanup lock without
 234 blocking.  If we succeed we can proceed with our housekeeping work.
 235 If we cannot get the lock (which should not happen often, except under
 236 very heavy contention) then the housekeeping has to be postponed till
 237 some other time.  The worst-case consequence of this is only that an
 238 UPDATE cannot be made HOT but has to link to a new tuple version placed on
 239 some other page, for lack of centralized space on the original page.
 240
 241 Ideally we would do defragmenting only when we are about to attempt
 242 heap_update on a HOT-safe tuple.  The difficulty with this approach
 243 is that the update query has certainly got a pin on the old tuple, and
 244 therefore our attempt to acquire a buffer cleanup lock will always fail.
 245 (This corresponds to the idea that we don't want to move the old tuple
 246 out from under where the query's HeapTuple pointer points.  It might
 247 be possible to finesse that, but it seems fragile.)
 248
 249 Pruning, however, is potentially useful even when we are not about to
 250 insert a new tuple, since shortening a HOT chain reduces the cost of
 251 subsequent index searches.  However it is unclear that this gain is
 252 large enough to accept any extra maintenance burden for.
 253
 254 The currently planned heuristic is to prune and defrag when first accessing
 255 a page that potentially has prunable tuples (as flagged by the pd_prune_xid
 256 page hint field) and that either has free space less than MAX(fillfactor
 257 target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
 258 find enough free space to store an updated tuple version.  (These rules
 259 are subject to change.)
 260
 261 We have effectively implemented the "truncate dead tuples to just line
 262 pointer" idea that has been proposed and rejected before because of fear
 263 of line pointer bloat: we might end up with huge numbers of line pointers
 264 and just a few actual tuples on a page.  To limit the damage in the worst
 265 case, and to keep various work arrays as well as the bitmaps in bitmap
 266 scans reasonably sized, the maximum number of line pointers per page
 267 is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that
 268 could fit without HOT pruning).
 269
 270 Effectively, space reclamation happens during tuple retrieval when the
 271 page is nearly full (<10% free) and a buffer cleanup lock can be
 272 acquired.  This means that UPDATE, DELETE, and SELECT can trigger space
 273 reclamation, but often not during INSERT ... VALUES because it does
 274 not retrieve a row.
 275
 276
 277 VACUUM
 278 ------
 279
 280 There is little change to regular vacuum.  It performs pruning to remove
 281 dead heap-only tuples, and cleans up any dead line pointers as if they were
 282 regular dead tuples.
 283
 284
 285 Statistics
 286 ----------
 287
 288 Currently, we count HOT updates the same as cold updates for statistics
 289 purposes, though there is an additional per-table counter that counts
 290 only HOT updates.  When a page pruning operation is able to remove a
 291 physical tuple by eliminating an intermediate heap-only tuple or
 292 replacing a physical root tuple by a redirect pointer, a decrement in
 293 the table's number of dead tuples is reported to pgstats, which may
 294 postpone autovacuuming.  Note that we do not count replacing a root tuple
 295 by a DEAD line pointer as decrementing dead_tuples; we still want
 296 autovacuum to run to clean up the index entries and DEAD item.
 297
 298 This area probably needs further work ...
 299
 300
 301 CREATE INDEX
 302 ------------
 303
 304 CREATE INDEX presents a problem for HOT updates.  While the existing HOT
 305 chains all have the same index values for existing indexes, the columns
 306 in the new index might change within a pre-existing HOT chain, creating
 307 a "broken" chain that can't be indexed properly.
 308
 309 To address this issue, regular (non-concurrent) CREATE INDEX makes the
 310 new index usable only by new transactions and transactions that don't
 311 have snapshots older than the CREATE INDEX command.  This prevents
 312 queries that can see the inconsistent HOT chains from trying to use the
 313 new index and getting incorrect results.  Queries that can see the index
 314 can only see the rows that were visible after the index was created,
 315 hence the HOT chains are consistent for them.
 316
 317 Entries in the new index point to root tuples (tuples with current index
 318 pointers) so that our index uses the same index pointers as all other
 319 indexes on the table.  However the row we want to index is actually at
 320 the *end* of the chain, ie, the most recent live tuple on the HOT chain.
 321 That is the one we compute the index entry values for, but the TID
 322 we put into the index is that of the root tuple.  Since queries that
 323 will be allowed to use the new index cannot see any of the older tuple
 324 versions in the chain, the fact that they might not match the index entry
 325 isn't a problem.  (Such queries will check the tuple visibility
 326 information of the older versions and ignore them, without ever looking at
 327 their contents, so the content inconsistency is OK.)  Subsequent updates
 328 to the live tuple will be allowed to extend the HOT chain only if they are
 329 HOT-safe for all the indexes.
 330
 331 Because we have ShareLock on the table, any DELETE_IN_PROGRESS or
 332 INSERT_IN_PROGRESS tuples should have come from our own transaction.
 333 Therefore we can consider them committed since if the CREATE INDEX
 334 commits, they will be committed, and if it aborts the index is discarded.
 335 An exception to this is that early lock release is customary for system
 336 catalog updates, and so we might find such tuples when reindexing a system
 337 catalog.  In that case we deal with it by waiting for the source
 338 transaction to commit or roll back.  (We could do that for user tables
 339 too, but since the case is unexpected we prefer to throw an error.)
 340
 341 Practically, we prevent certain transactions from using the new index by
 342 setting pg_index.indcheckxmin to TRUE.  Transactions are allowed to use
 343 such an index only after pg_index.xmin is below their TransactionXmin
 344 horizon, thereby ensuring that any incompatible rows in HOT chains are
 345 dead to them. (pg_index.xmin will be the XID of the CREATE INDEX
 346 transaction.  The reason for using xmin rather than a normal column is
 347 that the regular vacuum freezing mechanism will take care of converting
 348 xmin to FrozenTransactionId before it can wrap around.)
 349
 350 This means in particular that the transaction creating the index will be
 351 unable to use the index if the transaction has old snapshots.  We
 352 alleviate that problem somewhat by not setting indcheckxmin unless the
 353 table actually contains HOT chains with RECENTLY_DEAD members.
 354
 355 Another unpleasant consequence is that it is now risky to use SnapshotAny
 356 in an index scan: if the index was created more recently than the last
 357 vacuum, it's possible that some of the visited tuples do not match the
 358 index entry they are linked to.  This does not seem to be a fatal
 359 objection, since there are few users of SnapshotAny and most use seqscans.
 360 The only exception at this writing is CLUSTER, which is okay because it
 361 does not require perfect ordering of the indexscan readout (and especially
 362 so because CLUSTER tends to write recently-dead tuples out of order anyway).
 363
 364
 365 CREATE INDEX CONCURRENTLY
 366 -------------------------
 367
 368 In the concurrent case we must take a different approach.  We create the
 369 pg_index entry immediately, before we scan the table.  The pg_index entry
 370 is marked as "not ready for inserts".  Then we commit and wait for any
 371 transactions which have the table open to finish.  This ensures that no
 372 new HOT updates will change the key value for our new index, because all
 373 transactions will see the existence of the index and will respect its
 374 constraint on which updates can be HOT.  Other transactions must include
 375 such an index when determining HOT-safety of updates, even though they
 376 must ignore it for both insertion and searching purposes.
 377
 378 We must do this to avoid making incorrect index entries.  For example,
 379 suppose we are building an index on column X and we make an index entry for
 380 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
 381 indexed column, HOT-updates the row to have X=2, and commits.  We now have
 382 an index entry for X=1 pointing at a HOT chain whose live row has X=2.
 383 We could make an index entry with X=2 during the validation pass, but
 384 there is no nice way to get rid of the wrong entry with X=1.  So we must
 385 have the HOT-safety property enforced before we start to build the new
 386 index.
 387
 388 After waiting for transactions which had the table open, we build the index
 389 for all rows that are valid in a fresh snapshot.  Any tuples visible in the
 390 snapshot will have only valid forward-growing HOT chains.  (They might have
 391 older HOT updates behind them which are broken, but this is OK for the same
 392 reason it's OK in a regular index build.)  As above, we point the index
 393 entry at the root of the HOT-update chain but we use the key value from the
 394 live tuple.
 395
 396 We mark the index open for inserts (but still not ready for reads) then
 397 we again wait for transactions which have the table open.  Then we take
 398 a second reference snapshot and validate the index.  This searches for
 399 tuples missing from the index, and inserts any missing ones.  Again,
 400 the index entries have to have TIDs equal to HOT-chain root TIDs, but
 401 the value to be inserted is the one from the live tuple.
 402
 403 Then we wait until every transaction that could have a snapshot older than
 404 the second reference snapshot is finished.  This ensures that nobody is
 405 alive any longer who could need to see any tuples that might be missing
 406 from the index, as well as ensuring that no one can see any inconsistent
 407 rows in a broken HOT chain (the first condition is stronger than the
 408 second).  Finally, we can mark the index valid for searches.
 409
 410 Note that we do not need to set pg_index.indcheckxmin in this code path,
 411 because we have outwaited any transactions that would need to avoid using
 412 the index.  (indcheckxmin is only needed because non-concurrent CREATE
 413 INDEX doesn't want to wait; its stronger lock would create too much risk of
 414 deadlock if it did.)
 415
 416
 417 DROP INDEX CONCURRENTLY
 418 -----------------------
 419
 420 DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX
 421 CONCURRENTLY.  We first mark the index as not indisvalid, and then wait for
 422 any transactions that could be using it in queries to end.  (During this
 423 time, index updates must still be performed as normal, since such
 424 transactions might expect freshly inserted tuples to be findable.)
 425 Then, we clear indisready and indislive, and again wait for transactions
 426 that could be updating the index to end.  Finally we can drop the index
 427 normally (though taking only ShareUpdateExclusiveLock on its parent table).
 428
 429 The reason we need the pg_index.indislive flag is that after the second
 430 wait step begins, we don't want transactions to be touching the index at
 431 all; otherwise they might suffer errors if the DROP finally commits while
 432 they are reading catalog entries for the index.  If we had only indisvalid
 433 and indisready, this state would be indistinguishable from the first stage
 434 of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want
 435 transactions to examine the index, since they must consider it in
 436 HOT-safety checks.
 437
 438
 439 Limitations and Restrictions
 440 ----------------------------
 441
 442 It is worth noting that HOT forever forecloses alternative approaches
 443 to vacuuming, specifically the recompute-the-index-keys approach alluded
 444 to in Technical Challenges above.  It'll be tough to recompute the index
 445 keys for a root line pointer you don't have data for anymore ...
 446
 447
 448 Glossary
 449 --------
 450
 451 Broken HOT Chain
 452
 453     A HOT chain in which the key value for an index has changed.
 454
 455     This is not allowed to occur normally but if a new index is created
 456     it can happen.  In that case various strategies are used to ensure
 457     that no transaction for which the older tuples are visible can
 458     use the index.
 459
 460 Cold update
 461
 462     A normal, non-HOT update, in which index entries are made for
 463     the new version of the tuple.
 464
 465 Dead line pointer
 466
 467     A stub line pointer, that does not point to anything, but cannot
 468     be removed or reused yet because there are index pointers to it.
 469     Semantically same as a dead tuple.  It has state LP_DEAD.
 470
 471 Heap-only tuple
 472
 473     A heap tuple with no index pointers, which can only be reached
 474     from indexes indirectly through its ancestral root tuple.
 475     Marked with HEAP_ONLY_TUPLE flag.
 476
 477 HOT-safe
 478
 479     A proposed tuple update is said to be HOT-safe if it changes
 480     none of the tuple's indexed columns.  It will only become an
 481     actual HOT update if we can find room on the same page for
 482     the new tuple version.
 483
 484 HOT update
 485
 486     An UPDATE where the new tuple becomes a heap-only tuple, and no
 487     new index entries are made.
 488
 489 HOT-updated tuple
 490
 491     An updated tuple, for which the next tuple in the chain is a
 492     heap-only tuple.  Marked with HEAP_HOT_UPDATED flag.
 493
 494 Indexed column
 495
 496     A column used in an index definition.  The column might not
 497     actually be stored in the index --- it could be used in a
 498     functional index's expression, or used in a partial index
 499     predicate.  HOT treats all these cases alike.
 500
 501 Redirecting line pointer
 502
 503     A line pointer that points to another line pointer and has no
 504     associated tuple.  It has the special lp_flags state LP_REDIRECT,
 505     and lp_off is the OffsetNumber of the line pointer it links to.
 506     This is used when a root tuple becomes dead but we cannot prune
 507     the line pointer because there are non-dead heap-only tuples
 508     further down the chain.
 509
 510 Root tuple
 511
 512     The first tuple in a HOT update chain; the one that indexes point to.
 513
 514 Update chain
 515
 516     A chain of updated tuples, in which each tuple's ctid points to
 517     the next tuple in the chain. A HOT update chain is an update chain
 518     (or portion of an update chain) that consists of a root tuple and
 519     one or more heap-only tuples.  A complete update chain can contain
 520     both HOT and non-HOT (cold) updated tuples.