summaryrefslogtreecommitdiff
path: root/src/backend/storage
AgeCommit message (Collapse)Author
2023-07-26Fix crash with RemoveFromWaitQueue() when detecting a deadlock.Masahiko Sawada
Commit 5764f611e used dclist_delete_from() to remove the proc from the wait queue. However, since it doesn't clear dist_node's next/prev to NULL, it could call RemoveFromWaitQueue() twice: when the process detects a deadlock and then when cleaning up locks on aborting the transaction. The waiting lock information is cleared in the first call, so it led to a crash in the second call. Backpatch to v16, where the change was introduced. Bug: #18031 Reported-by: Justin Pryzby, Alexander Lakhin Reviewed-by: Andres Freund Discussion: https://postgr.es/m/ZKy4AdrLEfbqrxGJ%40telsasoft.com Discussion: https://postgr.es/m/[email protected] Backpatch-through: 16
2023-07-26Document more assumptions of LWLock variable changes with WAL insertsMichael Paquier
This commit adds a few comments about what LWLockWaitForVar() relies on when a backend waits for a variable update on its LWLocks for WAL insertions up to an expected LSN. First, LWLockWaitForVar() does not include a memory barrier, relying on a spinlock taken at the beginning of WaitXLogInsertionsToFinish(). This was hidden behind two layers of routines in lwlock.c. This assumption is now documented at the top of LWLockWaitForVar(), and detailed at bit more within LWLockConflictsWithVar(). Second, document why WaitXLogInsertionsToFinish() does not include memory barriers, relying on a spinlock at its top, which is, per Andres' input, fine for two different reasons, both depending on the fact that the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a certain value. This area's documentation and assumptions could be improved more in the future, but at least that's a beginning. Author: Bharath Rupireddy, Andres Freund Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-07-25Optimize WAL insertion lock acquisition and release with some atomicsMichael Paquier
The WAL insertion lock variable insertingAt is currently being read and written with the help of the LWLock wait list lock to avoid any read of torn values. This wait list lock can become a point of contention on a highly concurrent write workloads. This commit switches insertingAt to a 64b atomic variable that provides torn-free reads/writes. On platforms without 64b atomic support, the fallback implementation uses spinlocks to provide the same guarantees for the values read. LWLockWaitForVar(), through LWLockConflictsWithVar(), reads the new value to check if it still needs to wait with a u64 atomic operation. LWLockUpdateVar() updates the variable before waking up the waiters with an exchange_u64 (full memory barrier). LWLockReleaseClearVar() now uses also an exchange_u64 to reset the variable. Before this commit, all these steps relied on LWLockWaitListLock() and LWLockWaitListUnlock(). This reduces contention on LWLock wait list lock and improves performance of highly-concurrent write workloads. Here are some numbers using pg_logical_emit_message() (HEAD at d6677b93) with various arbitrary record lengths and clients up to 1k on a rather-large machine (64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of TPS numbers coming from pgbench: message_size_b | 16 | 64 | 256 | 1024 --------------------+--------+--------+--------+------- patch_4_clients | 83830 | 82929 | 80478 | 73131 patch_16_clients | 267655 | 264973 | 250566 | 213985 patch_64_clients | 380423 | 378318 | 356907 | 294248 patch_256_clients | 360915 | 354436 | 326209 | 263664 patch_512_clients | 332654 | 321199 | 287521 | 240128 patch_1024_clients | 288263 | 276614 | 258220 | 217063 patch_2048_clients | 252280 | 243558 | 230062 | 192429 patch_4096_clients | 212566 | 213654 | 205951 | 166955 head_4_clients | 83686 | 83766 | 81233 | 73749 head_16_clients | 266503 | 265546 | 249261 | 213645 head_64_clients | 366122 | 363462 | 341078 | 261707 head_256_clients | 132600 | 132573 | 134392 | 165799 head_512_clients | 118937 | 114332 | 116860 | 150672 head_1024_clients | 133546 | 115256 | 125236 | 151390 head_2048_clients | 137877 | 117802 | 120909 | 138165 head_4096_clients | 113440 | 115611 | 120635 | 114361 Bharath has been measuring similar improvements, where the limit of the WAL insertion lock begins to be felt when more than 256 concurrent clients are involved in this specific workload. An extra patch has been discussed to introduce a fast-exit path in LWLockUpdateVar() when there are no waiters, still this does not influence the write-heavy workload cases discussed as there are always waiters. This will be considered separately. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-07-25Fix off-by-one in LimitAdditionalPins()Andres Freund
Due to the bug LimitAdditionalPins() could return 0, violating LimitAdditionalPins()'s API ("One additional pin is always allowed"). This could be hit when setting shared_buffers very low and using a fair amount of concurrency. This bug was introduced in 31966b151e6a. Author: "Anton A. Melnikov" <[email protected]> Reported-by: "Anton A. Melnikov" <[email protected]> Reported-by: Victoria Shepard Discussion: https://postgr.es/m/[email protected] Backpatch: 16-
2023-07-14Remove wal_sync_method=fsync_writethrough on Windows.Thomas Munro
The "fsync" level already flushes drive write caches on Windows (as does "fdatasync"), so it only confuses matters to have an apparently higher level that isn't actually different at all. That leaves "fsync_writethrough" only for macOS, where it actually does something different. Reviewed-by: Magnus Hagander <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKGJ2CG2SouPv2mca2WCTOJxYumvBARRcKPraFMB6GSEMcA%40mail.gmail.com
2023-07-10Message wording improvementsPeter Eisentraut
2023-07-06Add GUC parameter "huge_pages_status"Michael Paquier
This is useful to show the allocation state of huge pages when setting up a server with "huge_pages = try", where allocating huge pages would be attempted but the server would continue its startup sequence even if the allocation fails. The effective status of huge pages is not easily visible without OS-level tools (or for instance, a lookup at /proc/N/smaps), and the environments where Postgres runs may not authorize that. Like the other GUCs related to huge pages, this works for Linux and Windows. This GUC can report as values: - "on", if huge pages were allocated. - "off", if huge pages were not allocated. - "unknown", a special state that could only be seen when using for example postgres -C because it is only possible to know if the shared memory allocation worked after we can check for the GUC values, even if checking a runtime-computed GUC. This value should never be seen when querying for the GUC on a running server. An assertion is added to check that. The discussion has also turned around having a new function to grab this status, but this would have required more tricks for -DEXEC_BACKEND, something that GUCs already handle. Noriyoshi Shinoda has initiated the thread that has led to the result of this commit. Author: Justin Pryzby Reviewed-by: Nathan Bossart, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/TU4PR8401MB1152EBB0D271F827E2E37A01EECC9@TU4PR8401MB1152.NAMPRD84.PROD.OUTLOOK.COM
2023-07-06Revert the commits related to allowing page lock to conflict among parallel ↵Amit Kapila
group members. This commit reverts the work done by commits 3ba59ccc89 and 72e78d831a. Those commits were incorrect in asserting that we never acquire any other heavy-weight lock after acquring page lock other than relation extension lock. We can acquire a lock on catalogs while doing catalog look up after acquring page lock. This won't impact any existing feature but we need to think some other way to achieve this before parallelizing other write operations or even improving the parallelism in vacuum (like allowing multiple workers for an index). Reported-by: Jaime Casanova Author: Amit Kapila Backpatch-through: 13 Discussion: https://postgr.es/m/CAJKUy5jffnRKNvRHKQ0LynRb0RJC-o4P8Ku3x9vGAVLwDBWumQ@mail.gmail.com
2023-07-05Generate automatically code and documentation related to wait eventsMichael Paquier
The documentation and the code is generated automatically from a new file called wait_event_names.txt, formatted in sections dedicated to each wait event class (Timeout, Lock, IO, etc.) with three tab-separated fields: - C symbol in enums - Format in the system views - Description in the docs Using this approach has several advantages, as we have proved to be rather bad in maintaining this area of the tree across the years: - The order of each item in the documentation and the code, which should be alphabetical, has become incorrect multiple times, and the script generating the code and documentation has a few rules to enforce that, making the maintenance a no-brainer. - Some wait events were added to the code, but not documented, so this cannot be missed now. - The order of the tables for each wait event class is enforced in the documentation (the input .txt file does so as well for clarity, though this is not mandatory). - Less code, shaving 1.2k lines from the tree, with 1/3 of the savings coming from the code, the rest from the documentation. The wait event types "Lock" and "LWLock" still have their own code path for their code, hence only the documentation is created for them. These classes are listed with a special marker called WAIT_EVENT_DOCONLY in the input file. Adding a new wait event now requires only an update of wait_event_names.txt, with "Lock" and "LWLock" treated as exceptions. This commit has been tested with configure/Makefile, the CI and VPATH build. clean, distclean and maintainer-clean were working fine. Author: Bertrand Drouvot, Michael Paquier Discussion: https://postgr.es/m/[email protected]
2023-07-04Ensure that creation of an empty relfile is fsync'd at checkpoint.Heikki Linnakangas
If you create a table and don't insert any data into it, the relation file is never fsync'd. You don't lose data, because an empty table doesn't have any data to begin with, but if you crash and lose the file, subsequent operations on the table will fail with "could not open file" error. To fix, register an fsync request in mdcreate(), like we do for mdwrite(). Per discussion, we probably should also fsync the containing directory after creating a new file. But that's a separate and much wider issue. Backpatch to all supported versions. Reviewed-by: Andres Freund, Thomas Munro Discussion: https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi
2023-07-03Refactor some code related to wait events "BufferPin" and "Extension"Michael Paquier
The following changes are done: - Addition of WaitEventBufferPin and WaitEventExtension, that hold a list of wait events related to each category. - Addition of two functions that encapsulate the list of wait events for each category. - Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an underscore, requiring a specific rule in the automation script). These changes make a bit easier the automatic generation of all the code and documentation related to wait events, as all the wait event categories are now controlled by consistent structures and functions. Author: Bertrand Drouvot Discussion: https://postgr.es/m/[email protected] Discussion: https://postgr.es/m/[email protected]
2023-07-02Trust signalfd on illumos, again.Thomas Munro
Commit 3ab4fc5d avoided choosing signalfd by default on illumos, because it triggered kernel panics. That was fixed, so we can remove a kludge from our code. Users/packagers can still override the default choice at compile time if desired, and we'll leave the back-branches unchanged so they keep choosing self-pipe by default, but we'll default to signalfd (like we do for Linux) in 17. Fixed kernels should be everywhere by the time 17 ships. The illumos issues were: * https://www.illumos.org/issues/13700 * https://www.illumos.org/issues/14892 Discussion: https://postgr.es/m/CA+hUKG+NK-K_G_i1H3OpDTwYPEsiwQi_jw58PGcW2H+-N2eVCA@mail.gmail.com
2023-06-29Error message wording improvementsPeter Eisentraut
2023-06-23Error message refactoringPeter Eisentraut
Take some untranslatable things out of the message and replace by format placeholders, to reduce translatable strings and reduce translation mistakes.
2023-06-20Pre-beta2 mechanical code beautification.Tom Lane
Run pgindent and pgperltidy. It seems we're still some ways away from all committers doing this automatically. Now that we have a buildfarm animal that will whine about poorly-indented code, we'll try to keep the tree more tidy. Discussion: https://postgr.es/m/[email protected]
2023-06-19fd.c: Retry after EINTR in more placesAndres Freund
Starting with 4d330a61bb1 we can use posix_fallocate() to extend files. Unfortunately in some situation, e.g. on tmpfs filesystems, EINTR may be returned. See also 4518c798b2b. To fix, add a retry path to FileFallocate(). In contrast to 4518c798b2b the amount we extend by is limited and the extending may happen at a high frequency, so disabling signals does not appear to be the correct path here. Also add retry paths to other file operations currently lacking them (around fdatasync(), fsync(), ftruncate(), posix_fadvise(), sync_file_range(), truncate()) - they are all documented or have been observed to return EINTR. Even though most of these functions used in the back branches, it does not seem worth the risk to backpatch - outside of the new-to-16 case of posix_fallocate() I am not aware of problem reports due to the lack of retries. Reported-by: Christoph Berg <[email protected]> Discussion: https://postgr.es/m/[email protected] Backpatch: -
2023-06-14Fix typo in comment.Masahiko Sawada
Introduced in 4d330a61bb1. Author: Masahiko Sawada Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAD21AoDg8rTWJkrNJg9UTP89vS8smfib2c55DVqKrCn8zR-GYA@mail.gmail.com
2023-06-12Report stats when replaying XLOG_RUNNING_XACTSAndres Freund
Previously stats in the startup process would only get reported during shutdown of the startup process. It has been that way for a long time, but became a lot more noticeable with the new pg_stat_io view, which separates out IO done by different backend types... While replaying after every XLOG_RUNNING_XACTS isn't the prettiest approach, it has the advantage of being quite easy. Given that we're well past feature freeze... It's not a problem that we don't report stats more frequently with wal_level=minimal, in that case stats can't be read before the stats process has shut down. Besides the above, this commit also changes pgstat_report_stat() to acquire the timestamp with GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp(). Thanks to Melih Mutlu, Kyotaro Horiguchi for prototypes of other approaches to solving this issue. Reported-by: Fujii Masao <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-06-12Remove a few unused global variables and declarations.Heikki Linnakangas
- Commit 3eb77eba5a, which moved the pending ops queue from md.c to sync.c, introduced a duplicate, unused 'pendingOpsCxt' variable. (I'm surprised none of the compilers or static analysis tools have complained about that.) - Commit c2fe139c20 moved the 'synchronize_seqscans' variable and introduced an extern declaration in tableam.h, making the one in guc_tables.c unnecessary. - Commit 6f0cf87872 removed the 'pgstat_temp_directory' GUC, but forgot to remove the corresponding global variable. - Commit 1b4e729eaa removed the 'pg_krb_realm' GUC, and its global variable, but forgot the declaration in auth.h. Spotted all these by reading the code.
2023-05-21Remove over-eager assertion in ExtendBufferedRelTo()Andres Freund
The assertion checked that the size of the relation is not "too large" - but the code is explicitly dealing with the possibility of another backend extending the relation concurrently. In that case the new relation size could be bigger than what the current backend needs, wrongly triggering an assertion failure. Unfortunately it is hard to write a reliable and affordable regression tests for this, as a lot of concurrency is needed to encounter the bug. Introduced in 31966b151e6a. Reported-by: Melanie Plageman <[email protected]>
2023-05-19Pre-beta mechanical code beautification.Tom Lane
Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/[email protected]
2023-05-19Remove stray mid-sentence tabs in commentsPeter Eisentraut
2023-05-19Move mdwriteback() to better placePeter Eisentraut
The previous order in the file didn't make sense and matched neither the header file nor the smgr API. Discussion: https://www.postgresql.org/message-id/flat/22fed8ba-01c3-2008-a256-4ea912d68fab%40enterprisedb.com
2023-05-19Reindent some commentsPeter Eisentraut
Most (older) comments in md.c and smgr.c are indented with a leading tab on all lines, which isn't the current style and makes updating the comments a bit annoying. This reindents all these lines with a single space, as is the normal style. This issue exists in various shapes throughout the code but it's pretty consistent here, and since there is a patch pending to refresh some of the comments in these files, it seems sensible to clean this up here separately. Discussion: https://www.postgresql.org/message-id/flat/22fed8ba-01c3-2008-a256-4ea912d68fab%40enterprisedb.com
2023-05-17Add writeback to pg_stat_ioAndres Freund
28e626bde00 added the concept of IOOps but neglected to include writeback operations. ac8d53dae5 added time spent doing these I/O operations. Without counting writeback, checkpointer write time in the log often differed substantially from that in pg_stat_io. To fix this, add IOOp IOOP_WRITEBACK and track writeback in pg_stat_io. Bumps catversion. Author: Melanie Plageman <[email protected]> Reviewed-by: Kyotaro Horiguchi <[email protected]> Reported-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
2023-05-17Update parameter name context to wb_contextAndres Freund
For clarity of review, renaming the function parameter "context" in ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to "wb_context" is a separate commit. The next commit adds an "io_context" parameter and "wb_context" makes it more clear which is which. Author: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/CAAKRu_acc6iL4M3hvOTeztf_ZPpsB3Pqio5aVHgZ5q=Pi3BZKg@mail.gmail.com
2023-05-14Rename io_direct to debug_io_direct.Thomas Munro
Give the new GUC introduced by d4e71df6 a name that is clearly not intended for mainstream use quite yet. Future proposals would drop the prefix only after adding infrastructure to make it efficient. Having the switch in the tree sooner is good because it might lead to new discoveries about the hazards awaiting us on a wide range of systems, but that name was too enticing and could lead to cross-version confusion in future, per complaints from Noah and Justin. Suggested-by: Noah Misch <[email protected]> Reviewed-by: Noah Misch <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> (the idea, not the patch) Reviewed-by: Tom Lane <[email protected]> (ditto) Discussion: https://postgr.es/m/20230430041106.GA2268796%40rfd.leadboat.com
2023-05-05Fix typo with wait event for SLRU buffer of commit timestampsMichael Paquier
This wait event was documented as "CommitTsBuffer" since its introduction, but the code named it "CommitTSBuffer". This commit fixes the code to follow the term documented, which is also more consistent with the naming of the other wait events used for commit timestamps. Introduced by 5da1493. Author: Alexander Lakhin Discussion: https://postgr.es/m/[email protected] Backpatch-through: 13
2023-05-02Fix typos in commentsMichael Paquier
The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/[email protected]
2023-04-24Remove vacuum_defer_cleanup_ageAndres Freund
vacuum_defer_cleanup_age was introduced before hot_standby_feedback and replication slots existed. It is hard to use reasonably - commonly it will either be set too low (not preventing recovery conflicts, while still causing some bloat), or too high (causing a lot of bloat). The alternatives do not have that issue. That on its own might not be sufficient reason to remove vacuum_defer_cleanup_age, but it also complicates computation of xid horizons. See e.g. the bug fixed in be504a3e974. It also is untested. This commit removes TransactionIdRetreatSafely(), as there are no users anymore. There might be potential future users, hence noting that here. Reviewed-by: Daniel Gustafsson <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> Reviewed-by: Alvaro Herrera <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-19Fix various typos and incorrect/outdated name referencesDavid Rowley
Author: Alexander Lakhin Discussion: https://postgr.es/m/[email protected]
2023-04-18Fix various typosDavid Rowley
This fixes many spelling mistakes in comments, but a few references to invalid parameter names, function names and option names too in comments and also some in string constants Also, fix an #undef that was undefining the incorrect definition Author: Alexander Lakhin Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/[email protected]
2023-04-14Support RBM_ZERO_AND_CLEANUP_LOCK in ExtendBufferedRelTo(), add testsAndres Freund
For some reason I had not implemented RBM_ZERO_AND_CLEANUP_LOCK support in ExtendBufferedRelTo(), likely thinking it not being reachable. But it is reachable, e.g. when replaying a WAL record for a page in a relation that subsequently is truncated (likely only reachable when doing crash recovery or PITR, not during ongoing streaming replication). As now all of the RBM_* modes are supported, remove assertions checking mode. As we had no test coverage for this scenario, add a new TAP test. There's plenty more that ought to be tested in this area... Reported-by: Tom Lane <[email protected]> Reported-by: Alexander Lakhin <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Discussion: https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us Discussion: https://postgr.es/m/[email protected]
2023-04-13Harmonize some more function parameter names.Peter Geoghegan
Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced relatively recently, after the code base had parameter name mismatches fixed in bulk (see commits starting with commits 4274dc22 and 035ce1fe). pg_bsd_indent still has a couple of similar inconsistencies, which I (pgeoghegan) have left untouched for now. Like all earlier commits that cleaned up function parameter names, this commit was written with help from clang-tidy.
2023-04-08Handle logical slot conflicts on standbyAndres Freund
During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in 6af1793954e, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <[email protected]> Author: Andres Freund <[email protected]> Author: Amit Khandekar <[email protected]> (in an older version) Reviewed-by: "Drouvot, Bertrand" <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Robert Haas <[email protected]> Reviewed-by: Fabrízio de Royes Mello <[email protected]> Reviewed-by: Bharath Rupireddy <[email protected]> Reviewed-by: Amit Kapila <[email protected]> Reviewed-by: Alvaro Herrera <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-08Add io_direct setting (developer-only).Thomas Munro
Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <[email protected]> Author: Andres Freund <[email protected]> Author: Bharath Rupireddy <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> Reviewed-by: Bharath Rupireddy <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
2023-04-08Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <[email protected]> Author: Andres Freund <[email protected]> Reviewed-by: Justin Pryzby <[email protected]> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-08Track IO times in pg_stat_ioAndres Freund
a9c70b46dbe and 8aaa04b32S added counting of IO operations to a new view, pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to pg_stat_io as well. This combines the tracking for pgBufferUsage with the tracking for pg_stat_io into a new function pgstat_count_io_op_time(). This should make it a bit easier to avoid the somewhat costly instr_time conversion done for pgBufferUsage. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Bertrand Drouvot <[email protected]> Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
2023-04-07Improve IO accounting for temp relation writesAndres Freund
Both pgstat_database and pgBufferUsage count IO timing for reads of temporary relation blocks into local buffers. However, both failed to count write IO timing for flushes of dirty local buffers. Fix. Additionally, FlushRelationBuffers() seems to have omitted counting write IO (both count and timing) stats for both pgstat_database and pgBufferUsage. Fix. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
2023-04-07Fix copy-paste bug in 12f3867f553 triggering an assert after a write errorAndres Freund
The same condition accidentally was copied to both branches. Manual testing confirms that otherwise the error recovery path works fine. Found while reviewing the logical-decoding-on-standby patch.
2023-04-06Add VACUUM/ANALYZE BUFFER_USAGE_LIMIT optionDavid Rowley
Add new options to the VACUUM and ANALYZE commands called BUFFER_USAGE_LIMIT to allow users more control over how large to make the buffer access strategy that is used to limit the usage of buffers in shared buffers. Larger rings can allow VACUUM to run more quickly but have the drawback of VACUUM possibly evicting more buffers from shared buffers that might be useful for other queries running on the database. Here we also add a new GUC named vacuum_buffer_usage_limit which controls how large to make the access strategy when it's not specified in the VACUUM/ANALYZE command. This defaults to 256KB, which is the same size as the access strategy was prior to this change. This setting also controls how large to make the buffer access strategy for autovacuum. Per idea by Andres Freund. Author: Melanie Plageman Reviewed-by: David Rowley Reviewed-by: Andres Freund Reviewed-by: Justin Pryzby Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/[email protected]
2023-04-06Use ExtendBufferedRelTo() in {vm,fsm}_extend()Andres Freund
This uses ExtendBufferedRelTo(), introduced in 31966b151e6, to extend the visibilitymap and freespacemap to the size needed. It also happens to fix a warning introduced in 3d6a98457d8, reported by Tom Lane. Discussion: https://postgr.es/m/[email protected] Discussion: https://postgr.es/m/[email protected]
2023-04-05bufmgr: Introduce infrastructure for faster relation extensionAndres Freund
The primary bottlenecks for relation extension are: 1) The extension lock is held while acquiring a victim buffer for the new page. Acquiring a victim buffer can require writing out the old page contents including possibly needing to flush WAL. 2) When extending via ReadBuffer() et al, we write a zero page during the extension, and then later write out the actual page contents. This can nearly double the write rate. 3) The existing bulk relation extension infrastructure in hio.c just amortized the cost of acquiring the relation extension lock, but none of the other costs. Unfortunately 1) cannot currently be addressed in a central manner as the callers to ReadBuffer() need to acquire the extension lock. To address that, this this commit moves the responsibility for acquiring the extension lock into bufmgr.c functions. That allows to acquire the relation extension lock for just the required time. This will also allow us to improve relation extension further, without changing callers. The reason we write all-zeroes pages during relation extension is that we hope to get ENOSPC errors earlier that way (largely works, except for CoW filesystems). It is easier to handle out-of-space errors gracefully if the page doesn't yet contain actual tuples. This commit addresses 2), by using the recently introduced smgrzeroextend(), which extends the relation, without dirtying the kernel page cache for all the extended pages. To address 3), this commit introduces a function to extend a relation by multiple blocks at a time. There are three new exposed functions: ExtendBufferedRel() for extending the relation by a single block, ExtendBufferedRelBy() to extend a relation by multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up to a certain size. To avoid duplicating code between ReadBuffer(P_NEW) and the new functions, ReadBuffer(P_NEW) now implements relation extension with ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the relation lock is already held. Note that this commit does not yet lead to a meaningful performance or scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be converted to ExtendBuffered*(), which will be done in subsequent commits. Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-05bufmgr: Support multiple in-progress IOs by using resownerAndres Freund
A future patch will add support for extending relations by multiple blocks at once. To be concurrency safe, the buffers for those blocks need to be marked as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from an IO error for a single buffer. This commit extends that infrastructure to multiple buffers by using the resource owner infrastructure. This commit increases the size of the ResourceOwnerData struct, which appears to have a just about measurable overhead in very extreme workloads. Medium term we are planning to substantially shrink the size of ResourceOwnerData. Short term the increase is small enough to not worry about it for now. Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/[email protected] Discussion: https://postgr.es/m/[email protected]
2023-04-05bufmgr: Acquire and clean victim buffer separatelyAndres Freund
Previously we held buffer locks for two buffer mapping partitions at the same time to change the identity of buffers. Particularly for extending relations needing to hold the extension lock while acquiring a victim buffer is painful.But it also creates a bottleneck for workloads that just involve reads. Now we instead first acquire a victim buffer and write it out, if necessary. Then we remove that buffer from the old partition with just the old partition's partition lock held and insert it into the new partition with just that partition's lock held. By separating out the victim buffer acquisition, future commits will be able to change relation extensions to scale better. On my workstation, a micro-benchmark exercising buffered reads strenuously and under a lot of concurrency, sees a >2x improvement. Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Melanie Plageman <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-05bufmgr: Add Pin/UnpinLocalBuffer()Andres Freund
So far these were open-coded in quite a few places, without a good reason. Reviewed-by: Melanie Plageman <[email protected]> Reviewed-by: David Rowley <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-05bufmgr: Add some more error checking [infrastructure] around pinningAndres Freund
This adds a few more assertions against a buffer being local in places we don't expect, and extracts the check for a buffer being pinned exactly once from LockBufferForCleanup() into its own function. Later commits will use this function. Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Melanie Plageman <[email protected]> Discussion: http://postgr.es/m/[email protected]
2023-04-05Add smgrzeroextend(), FileZero(), FileFallocate()Andres Freund
smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple blocks. When extending by a small number of blocks, use FileZero() instead, as using posix_fallocate() for small numbers of blocks is inefficient for some file systems / operating systems. FileZero() is also used as the fallback for FileFallocate() on platforms / filesystems that don't support fallocate. A big advantage of using posix_fallocate() is that it typically won't cause dirty buffers in the kernel pagecache. So far the most common pattern in our code is that we smgrextend() a page full of zeroes and put the corresponding page into shared buffers, from where we later write out the actual contents of the page. If the kernel, e.g. due to memory pressure or elapsed time, already wrote back the all-zeroes page, this can lead to doubling the amount of writes reaching storage. There are no users of smgrzeroextend() as of this commit. That will follow in future commits. Reviewed-by: Melanie Plageman <[email protected]> Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Kyotaro Horiguchi <[email protected]> Reviewed-by: David Rowley <[email protected]> Reviewed-by: John Naylor <[email protected]> Discussion: https://postgr.es/m/[email protected]
2023-04-05Don't initialize page in {vm,fsm}_extend(), not neededAndres Freund
The read path needs to be able to initialize pages anyway, as relation extensions are not durable. By avoiding initializing pages, we can, in a future patch, extend the relation by multiple blocks at once. Using smgrextend() for {vm,fsm}_extend() is not a good idea in general, as at least one page of the VM/FSM will be read immediately after, always causing a cache miss, requiring us to read content we just wrote. Discussion: https://postgr.es/m/[email protected]
2023-04-04bufmgr: Remove buffer-write-dirty tracepointsAndres Freund
The trace point was using the relfileno / fork / block for the to-be-read-in buffer. Some upcoming work would make that more expensive to provide. We still have buffer-flush-start/done, which can serve most tracing needs that buffer-write-dirty could serve. Discussion: https://postgr.es/m/[email protected]