Make nbtree split REDO locking match original execution.

Make the nbtree page split REDO routine consistent with original execution in its approach to acquiring and releasing buffer locks (at least for pages on the tree level of the page being split). This brings btree_xlog_split() in line with btree_xlog_unlink_page(), which was taught to couple buffer locks by commit 9a9db08a. Note that the precise order in which we both acquire and release sibling buffer locks in btree_xlog_split() now matches original execution exactly (the precise order in which the locks are released probably doesn't matter much, but we might as well be consistent about it). The rule for nbtree REDO routines from here on is that same-level locks should be acquired in an order that's consistent with original execution. It's not practical to have a similar rule for cross-level page locks, since for the most part original execution holds those locks for a period that spans multiple atomic actions/WAL records. It's also not necessary, because clearly the cross-level lock coupling is only truly needed during original execution because of the presence of concurrent inserters. This is not a bug fix (unlike the similar aforementioned commit, commit 9a9db08a). The immediate reason to tighten things up in this area is to enable an upcoming enhancement to contrib/amcheck that allows it to verify that sibling links are in agreement with only an AccessShareLock (this check produced false positives when run on a replica server on account of the inconsistency fixed by this commit). But that's not the only reason to be stricter here. It is generally useful to make locking on replicas be as close to what happens during original execution as practically possible. It makes it less likely that hard to catch bugs will slip in in the future. The previous state of affairs seems to be a holdover from before the introduction of Hot Standby, when buffer lock acquisitions during recovery were totally unnecessary. See also: commit 3bbf668d, which tightened things up in this area a few years after the introduction of Hot Standby. Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
author: Peter Geoghegan 2020-08-07 22:27:56 +0000
committer: Peter Geoghegan 2020-08-07 22:27:56 +0000
commit: 0a7d771f0f63eb120e7f0a60aecd543ab25ba197 (patch)
tree: d57a68cf049f1fea53f5adfb19c25e5c9205a0e0 /src/backend/access/nbtree/README
parent: cea3d55898655582e3a3835a7bed2c3a1b002fef (diff)
1 files changed, 6 insertions, 17 deletions
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 9d5fc424a57..abce31a5a96 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -572,23 +572,12 @@ replay of page deletion records does not hold a write lock on the target
 leaf page throughout; only the primary needs to block out concurrent
 writers that insert on to the page being deleted.)
 
-There are also locking differences between the primary and WAL replay
-for the first stage of a page split (i.e. same-level differences in
-locking).  Replay of the first phase of a page split can get away with
-locking and updating the original right sibling page (which is also the
-new right sibling page's right sibling) after locks on the original page
-and its new right sibling have been released.  Again, this is okay
-because there are no writers.  Page deletion WAL replay cannot get away
-with being lax about same-level locking during replay, though -- doing
-so risks confusing concurrent backwards scans.
-
-Page deletion's second phase locks the left sibling page, target page,
-and right page in order on the standby, just like on the primary.  This
-allows backwards scans running on a standby to reason about page
-deletion on the leaf level; a page cannot appear deleted without that
-being reflected in the sibling pages.  It's probably possible to be more
-lax about how locks are acquired on the standby during the second phase
-of page deletion, but that hardly seems worth it.
+WAL replay holds same-level locks in a way that matches the approach
+taken during original execution, though. This prevent readers from
+observing same-level inconsistencies. It's probably possible to be more
+lax about how same-level locks are acquired during recovery (most kinds
+of readers could still move right to recover if we didn't couple
+same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
author	Peter Geoghegan	2020-08-07 22:27:56 +0000
committer	Peter Geoghegan	2020-08-07 22:27:56 +0000
commit	0a7d771f0f63eb120e7f0a60aecd543ab25ba197 (patch)
tree	d57a68cf049f1fea53f5adfb19c25e5c9205a0e0 /src/backend/access/nbtree/README
parent	cea3d55898655582e3a3835a7bed2c3a1b002fef (diff)