| Lists: | pgsql-hackers |
|---|
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-11 15:29:56 |
| Message-ID: | CAMT0RQT_0qVxcTT6ycM20QUN-pEQ6iMLbz6gLWgLpeF0NmNOUA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Attached is a patch that adds the ability to dump table data in multiple chunks.
Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks
The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,
The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.
It will also help in case the target file system has some limitations
on file sizes (4GB for FAT, 5TB for GCS).
Currently no tests are included in the patch and also no extra
documentation outside what is printed out by pg_dump --help . Also any
pg_log_warning lines with "CHUNKING" is there for debugging and needs
to be removed before committing.
As implemented no changes are needed for pg_restore as all chunks are
already associated with the table in .toc and thus are restored into
this table
the attached README shows how I verified it works and the textual
file created from the directory format dump in the last step there
--
Hannu
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-adds-ability-to-dump-data-for-tables-in-multiple-chu.patch | application/x-patch | 11.5 KB |
| README.pg_dump.md | text/markdown | 3.7 KB |
| dump.sql | application/sql | 56.2 KB |
| From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-12 12:59:33 |
| Message-ID: | CAExHW5t54GPKFbW3KLzintJ6jMMRYwb-t2Fjm4JTxEcZbGDomA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi Hannu,
On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Attached is a patch that adds the ability to dump table data in multiple chunks.
>
> Looking for feedback at this point:
> 1) what have I missed
> 2) should I implement something to avoid single-page chunks
>
> The flag --huge-table-chunk-pages which tells the directory format
> dump to dump tables where the main fork has more pages than this in
> multiple chunks of given number of pages,
>
> The main use case is speeding up parallel dumps in case of one or a
> small number of HUGE tables so parts of these can be dumped in
> parallel.
Have you measured speed up? Can you please share the numbers?
--
Best Wishes,
Ashutosh Bapat
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-13 18:02:43 |
| Message-ID: | CAMT0RQTHoL8S7OonFWC_aDSC-2oX7BGBBLAQ+OOBhRPcxV2eiw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
I just ran a test by generating a 408GB table and then dumping it both ways
$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb
real 39m54.968s
user 37m21.557s
sys 2m32.422s
$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb
real 5m52.965s
user 40m27.284s
sys 3m53.339s
So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.
this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.
On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> Hi Hannu,
>
> On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Attached is a patch that adds the ability to dump table data in multiple chunks.
> >
> > Looking for feedback at this point:
> > 1) what have I missed
> > 2) should I implement something to avoid single-page chunks
> >
> > The flag --huge-table-chunk-pages which tells the directory format
> > dump to dump tables where the main fork has more pages than this in
> > multiple chunks of given number of pages,
> >
> > The main use case is speeding up parallel dumps in case of one or a
> > small number of HUGE tables so parts of these can be dumped in
> > parallel.
>
> Have you measured speed up? Can you please share the numbers?
>
> --
> Best Wishes,
> Ashutosh Bapat
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-13 18:39:26 |
| Message-ID: | CAMT0RQQAH1a8kY-mx7B07Uzn3T_zeaU9detqFFtW36_k67Su+A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Going up to 16 workers did not improve performance , but this is
expected, as the disk behind the database can only do 4TB/hour of
reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)
$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
real 5m44.900s
user 53m50.491s
sys 5m47.602s
And 4 workers showed near-linear speedup from single worker
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
real 10m32.074s
user 38m54.436s
sys 2m58.216s
The database runs on a 64vCPU VM with 128GB RAM, so most of the table
will be read in from the disk
On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> I just ran a test by generating a 408GB table and then dumping it both ways
>
> $ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
> /tmp/plain.dump largedb
>
> real 39m54.968s
> user 37m21.557s
> sys 2m32.422s
>
> $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb
>
> real 5m52.965s
> user 40m27.284s
> sys 3m53.339s
>
> So parallel dump with 8 workers using 1GB (128k pages) chunks runs
> almost 7 times faster than the sequential dump.
>
> this was a table that had no TOAST part. I will run some more tests
> with TOASTed tables next and expect similar or better improvements.
>
>
>
> On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> >
> > Hi Hannu,
> >
> > On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > Attached is a patch that adds the ability to dump table data in multiple chunks.
> > >
> > > Looking for feedback at this point:
> > > 1) what have I missed
> > > 2) should I implement something to avoid single-page chunks
> > >
> > > The flag --huge-table-chunk-pages which tells the directory format
> > > dump to dump tables where the main fork has more pages than this in
> > > multiple chunks of given number of pages,
> > >
> > > The main use case is speeding up parallel dumps in case of one or a
> > > small number of HUGE tables so parts of these can be dumped in
> > > parallel.
> >
> > Have you measured speed up? Can you please share the numbers?
> >
> > --
> > Best Wishes,
> > Ashutosh Bapat
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-13 20:24:33 |
| Message-ID: | CAMT0RQQr7KtPAY903+F42csiHc1EPHo70Xji-znkxEhwdoKa6w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Ran another test with a 53GB database where most of the data is in TOAST
CREATE TABLE just_toasted(
id serial primary key,
toasted1 char(2200) STORAGE EXTERNAL,
toasted2 char(2200) STORAGE EXTERNAL,
toasted3 char(2200) STORAGE EXTERNAL,
toasted4 char(2200) STORAGE EXTERNAL
);
and the toast fields were added in somewhat randomised order.
Here the results are as follows
Parallelism | chunk size (pages) | time (sec)
1 | - | 240
2 | 1000 | 129
4 | 1000 | 64
8 | 1000 | 36
16 | 1000 | 30
4 | 9095 | 78
8 | 9095 | 42
16 | 9095 | 42
The reason larger chunk sizes performed worse was that they often had
one or two stragglers left behind which
Detailed run results below:
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres -f
/tmp/ltoastdb-1-plain.dump largetoastdb
real 3m59.465s
user 3m43.304s
sys 0m15.844s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
largetoastdb
real 1m18.320s
user 3m49.236s
sys 0m19.422s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
largetoastdb
real 0m42.028s
user 3m55.299s
sys 0m24.657s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
largetoastdb
real 0m42.575s
user 4m11.011s
sys 0m26.110s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
largetoastdb
real 0m29.641s
user 6m16.321s
sys 0m49.345s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
largetoastdb
real 0m35.685s
user 3m58.528s
sys 0m26.729s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
largetoastdb
real 1m3.737s
user 3m50.251s
sys 0m18.507s
hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
largetoastdb
real 2m8.708s
user 3m57.018s
sys 0m18.499s
On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Going up to 16 workers did not improve performance , but this is
> expected, as the disk behind the database can only do 4TB/hour of
> reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)
>
> $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
> real 5m44.900s
> user 53m50.491s
> sys 5m47.602s
>
> And 4 workers showed near-linear speedup from single worker
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
> real 10m32.074s
> user 38m54.436s
> sys 2m58.216s
>
> The database runs on a 64vCPU VM with 128GB RAM, so most of the table
> will be read in from the disk
>
>
>
>
>
>
> On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > I just ran a test by generating a 408GB table and then dumping it both ways
> >
> > $ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
> > /tmp/plain.dump largedb
> >
> > real 39m54.968s
> > user 37m21.557s
> > sys 2m32.422s
> >
> > $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb
> >
> > real 5m52.965s
> > user 40m27.284s
> > sys 3m53.339s
> >
> > So parallel dump with 8 workers using 1GB (128k pages) chunks runs
> > almost 7 times faster than the sequential dump.
> >
> > this was a table that had no TOAST part. I will run some more tests
> > with TOASTed tables next and expect similar or better improvements.
> >
> >
> >
> > On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
> > <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > >
> > > Hi Hannu,
> > >
> > > On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > >
> > > > Attached is a patch that adds the ability to dump table data in multiple chunks.
> > > >
> > > > Looking for feedback at this point:
> > > > 1) what have I missed
> > > > 2) should I implement something to avoid single-page chunks
> > > >
> > > > The flag --huge-table-chunk-pages which tells the directory format
> > > > dump to dump tables where the main fork has more pages than this in
> > > > multiple chunks of given number of pages,
> > > >
> > > > The main use case is speeding up parallel dumps in case of one or a
> > > > small number of HUGE tables so parts of these can be dumped in
> > > > parallel.
> > >
> > > Have you measured speed up? Can you please share the numbers?
> > >
> > > --
> > > Best Wishes,
> > > Ashutosh Bapat
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-13 20:26:46 |
| Message-ID: | CAMT0RQSNHFffbCmDNxQogVBD8H5gTDJNwhUR2btCVE+Lq1sGGw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
The reason for small chunk sizes is that they are determined by main
heap table, and that was just over 1GB
largetoastdb=> SELECT format('%I.%I', t.schemaname, t.relname) as table_name,
pg_table_size(t.relid) AS table_size,
sum(pg_relation_size(i.indexrelid)) AS total_index_size,
pg_relation_size(t.relid) AS main_table_size,
pg_relation_size(c.reltoastrelid) AS toast_table_size,
pg_relation_size(oi.indexrelid) AS toast_index_size,
t.n_live_tup AS row_count,
count(*) AS index_count,
array_to_json(array_agg(json_build_object(i.indexrelid::regclass,
pg_relation_size(i.indexrelid))), true) AS index_info
FROM pg_stat_user_tables t
JOIN pg_stat_user_indexes i ON i.relid = t.relid
JOIN pg_class c ON c.oid = t.relid
LEFT JOIN pg_stat_sys_indexes AS oi ON oi.relid = c.reltoastrelid
GROUP BY 1, 2, 4, 5, 6, 7
ORDER BY 2 DESC, 7 DESC
LIMIT 25;
┌─[ RECORD 1 ]─────┬─────────────────────────────────────┐
│ table_name │ public.just_toasted │
│ table_size │ 56718835712 │
│ total_index_size │ 230064128 │
│ main_table_size │ 1191559168 │
│ toast_table_size │ 54613336064 │
│ toast_index_size │ 898465792 │
│ row_count │ 5625234 │
│ index_count │ 1 │
│ index_info │ [{"just_toasted_pkey" : 230064128}] │
└──────────────────┴─────────────────────────────────────┘
On Thu, Nov 13, 2025 at 9:24 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Ran another test with a 53GB database where most of the data is in TOAST
>
> CREATE TABLE just_toasted(
> id serial primary key,
> toasted1 char(2200) STORAGE EXTERNAL,
> toasted2 char(2200) STORAGE EXTERNAL,
> toasted3 char(2200) STORAGE EXTERNAL,
> toasted4 char(2200) STORAGE EXTERNAL
> );
>
> and the toast fields were added in somewhat randomised order.
>
> Here the results are as follows
>
> Parallelism | chunk size (pages) | time (sec)
> 1 | - | 240
> 2 | 1000 | 129
> 4 | 1000 | 64
> 8 | 1000 | 36
> 16 | 1000 | 30
>
> 4 | 9095 | 78
> 8 | 9095 | 42
> 16 | 9095 | 42
>
> The reason larger chunk sizes performed worse was that they often had
> one or two stragglers left behind which
>
> Detailed run results below:
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres -f
> /tmp/ltoastdb-1-plain.dump largetoastdb
> real 3m59.465s
> user 3m43.304s
> sys 0m15.844s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
> largetoastdb
> real 1m18.320s
> user 3m49.236s
> sys 0m19.422s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
> largetoastdb
> real 0m42.028s
> user 3m55.299s
> sys 0m24.657s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
> largetoastdb
> real 0m42.575s
> user 4m11.011s
> sys 0m26.110s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
> largetoastdb
> real 0m29.641s
> user 6m16.321s
> sys 0m49.345s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
> largetoastdb
> real 0m35.685s
> user 3m58.528s
> sys 0m26.729s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
> largetoastdb
> real 1m3.737s
> user 3m50.251s
> sys 0m18.507s
>
> hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> --format=directory -h 10.58.80.2 -U postgres
> --huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
> largetoastdb
> real 2m8.708s
> user 3m57.018s
> sys 0m18.499s
>
> On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Going up to 16 workers did not improve performance , but this is
> > expected, as the disk behind the database can only do 4TB/hour of
> > reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)
> >
> > $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
> > real 5m44.900s
> > user 53m50.491s
> > sys 5m47.602s
> >
> > And 4 workers showed near-linear speedup from single worker
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
> > real 10m32.074s
> > user 38m54.436s
> > sys 2m58.216s
> >
> > The database runs on a 64vCPU VM with 128GB RAM, so most of the table
> > will be read in from the disk
> >
> >
> >
> >
> >
> >
> > On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > I just ran a test by generating a 408GB table and then dumping it both ways
> > >
> > > $ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
> > > /tmp/plain.dump largedb
> > >
> > > real 39m54.968s
> > > user 37m21.557s
> > > sys 2m32.422s
> > >
> > > $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> > > --huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb
> > >
> > > real 5m52.965s
> > > user 40m27.284s
> > > sys 3m53.339s
> > >
> > > So parallel dump with 8 workers using 1GB (128k pages) chunks runs
> > > almost 7 times faster than the sequential dump.
> > >
> > > this was a table that had no TOAST part. I will run some more tests
> > > with TOASTed tables next and expect similar or better improvements.
> > >
> > >
> > >
> > > On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
> > > <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > > >
> > > > Hi Hannu,
> > > >
> > > > On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > > >
> > > > > Attached is a patch that adds the ability to dump table data in multiple chunks.
> > > > >
> > > > > Looking for feedback at this point:
> > > > > 1) what have I missed
> > > > > 2) should I implement something to avoid single-page chunks
> > > > >
> > > > > The flag --huge-table-chunk-pages which tells the directory format
> > > > > dump to dump tables where the main fork has more pages than this in
> > > > > multiple chunks of given number of pages,
> > > > >
> > > > > The main use case is speeding up parallel dumps in case of one or a
> > > > > small number of HUGE tables so parts of these can be dumped in
> > > > > parallel.
> > > >
> > > > Have you measured speed up? Can you please share the numbers?
> > > >
> > > > --
> > > > Best Wishes,
> > > > Ashutosh Bapat
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-13 20:34:23 |
| Message-ID: | CAMT0RQTEFGctCfgVx3u2XgVRCAj_QURV2tfdzL0HOQi=u0sV2A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Added to https://commitfest.postgresql.org/patch/6219/
On Thu, Nov 13, 2025 at 9:26 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> The reason for small chunk sizes is that they are determined by main
> heap table, and that was just over 1GB
>
> largetoastdb=> SELECT format('%I.%I', t.schemaname, t.relname) as table_name,
> pg_table_size(t.relid) AS table_size,
> sum(pg_relation_size(i.indexrelid)) AS total_index_size,
> pg_relation_size(t.relid) AS main_table_size,
> pg_relation_size(c.reltoastrelid) AS toast_table_size,
> pg_relation_size(oi.indexrelid) AS toast_index_size,
> t.n_live_tup AS row_count,
> count(*) AS index_count,
> array_to_json(array_agg(json_build_object(i.indexrelid::regclass,
> pg_relation_size(i.indexrelid))), true) AS index_info
> FROM pg_stat_user_tables t
> JOIN pg_stat_user_indexes i ON i.relid = t.relid
> JOIN pg_class c ON c.oid = t.relid
> LEFT JOIN pg_stat_sys_indexes AS oi ON oi.relid = c.reltoastrelid
> GROUP BY 1, 2, 4, 5, 6, 7
> ORDER BY 2 DESC, 7 DESC
> LIMIT 25;
> ┌─[ RECORD 1 ]─────┬─────────────────────────────────────┐
> │ table_name │ public.just_toasted │
> │ table_size │ 56718835712 │
> │ total_index_size │ 230064128 │
> │ main_table_size │ 1191559168 │
> │ toast_table_size │ 54613336064 │
> │ toast_index_size │ 898465792 │
> │ row_count │ 5625234 │
> │ index_count │ 1 │
> │ index_info │ [{"just_toasted_pkey" : 230064128}] │
> └──────────────────┴─────────────────────────────────────┘
>
> On Thu, Nov 13, 2025 at 9:24 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Ran another test with a 53GB database where most of the data is in TOAST
> >
> > CREATE TABLE just_toasted(
> > id serial primary key,
> > toasted1 char(2200) STORAGE EXTERNAL,
> > toasted2 char(2200) STORAGE EXTERNAL,
> > toasted3 char(2200) STORAGE EXTERNAL,
> > toasted4 char(2200) STORAGE EXTERNAL
> > );
> >
> > and the toast fields were added in somewhat randomised order.
> >
> > Here the results are as follows
> >
> > Parallelism | chunk size (pages) | time (sec)
> > 1 | - | 240
> > 2 | 1000 | 129
> > 4 | 1000 | 64
> > 8 | 1000 | 36
> > 16 | 1000 | 30
> >
> > 4 | 9095 | 78
> > 8 | 9095 | 42
> > 16 | 9095 | 42
> >
> > The reason larger chunk sizes performed worse was that they often had
> > one or two stragglers left behind which
> >
> > Detailed run results below:
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres -f
> > /tmp/ltoastdb-1-plain.dump largetoastdb
> > real 3m59.465s
> > user 3m43.304s
> > sys 0m15.844s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
> > largetoastdb
> > real 1m18.320s
> > user 3m49.236s
> > sys 0m19.422s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
> > largetoastdb
> > real 0m42.028s
> > user 3m55.299s
> > sys 0m24.657s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
> > largetoastdb
> > real 0m42.575s
> > user 4m11.011s
> > sys 0m26.110s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
> > largetoastdb
> > real 0m29.641s
> > user 6m16.321s
> > sys 0m49.345s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
> > largetoastdb
> > real 0m35.685s
> > user 3m58.528s
> > sys 0m26.729s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
> > largetoastdb
> > real 1m3.737s
> > user 3m50.251s
> > sys 0m18.507s
> >
> > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > --format=directory -h 10.58.80.2 -U postgres
> > --huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
> > largetoastdb
> > real 2m8.708s
> > user 3m57.018s
> > sys 0m18.499s
> >
> > On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > Going up to 16 workers did not improve performance , but this is
> > > expected, as the disk behind the database can only do 4TB/hour of
> > > reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)
> > >
> > > $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> > > --huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
> > > real 5m44.900s
> > > user 53m50.491s
> > > sys 5m47.602s
> > >
> > > And 4 workers showed near-linear speedup from single worker
> > >
> > > hannuk(at)pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
> > > --format=directory -h 10.58.80.2 -U postgres
> > > --huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
> > > real 10m32.074s
> > > user 38m54.436s
> > > sys 2m58.216s
> > >
> > > The database runs on a 64vCPU VM with 128GB RAM, so most of the table
> > > will be read in from the disk
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > >
> > > > I just ran a test by generating a 408GB table and then dumping it both ways
> > > >
> > > > $ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
> > > > /tmp/plain.dump largedb
> > > >
> > > > real 39m54.968s
> > > > user 37m21.557s
> > > > sys 2m32.422s
> > > >
> > > > $ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
> > > > --huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb
> > > >
> > > > real 5m52.965s
> > > > user 40m27.284s
> > > > sys 3m53.339s
> > > >
> > > > So parallel dump with 8 workers using 1GB (128k pages) chunks runs
> > > > almost 7 times faster than the sequential dump.
> > > >
> > > > this was a table that had no TOAST part. I will run some more tests
> > > > with TOASTed tables next and expect similar or better improvements.
> > > >
> > > >
> > > >
> > > > On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
> > > > <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > > > >
> > > > > Hi Hannu,
> > > > >
> > > > > On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > > > >
> > > > > > Attached is a patch that adds the ability to dump table data in multiple chunks.
> > > > > >
> > > > > > Looking for feedback at this point:
> > > > > > 1) what have I missed
> > > > > > 2) should I implement something to avoid single-page chunks
> > > > > >
> > > > > > The flag --huge-table-chunk-pages which tells the directory format
> > > > > > dump to dump tables where the main fork has more pages than this in
> > > > > > multiple chunks of given number of pages,
> > > > > >
> > > > > > The main use case is speeding up parallel dumps in case of one or a
> > > > > > small number of HUGE tables so parts of these can be dumped in
> > > > > > parallel.
> > > > >
> > > > > Have you measured speed up? Can you please share the numbers?
> > > > >
> > > > > --
> > > > > Best Wishes,
> > > > > Ashutosh Bapat
| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-17 04:15:17 |
| Message-ID: | CAFiTN-tV4jWKN75E5YLB-jSqb8j0E1PctiDjztv=ccfbe3YPmg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Attached is a patch that adds the ability to dump table data in multiple chunks.
>
> Looking for feedback at this point:
> 1) what have I missed
> 2) should I implement something to avoid single-page chunks
>
> The flag --huge-table-chunk-pages which tells the directory format
> dump to dump tables where the main fork has more pages than this in
> multiple chunks of given number of pages,
>
> The main use case is speeding up parallel dumps in case of one or a
> small number of HUGE tables so parts of these can be dumped in
> parallel.
>
+1 for the idea, I haven't done the detailed review but I was just
going through the patch, I noticed that we use pg_class->relpages to
identify whether to chunk the table or not, which should be fine but
don't you think if we use direct size calculation function like
pg_relation_size() we might get better idea and not dependent upon
whether the stats are updated or not? This will make chunking
behavior more deterministic.
--
Regards,
Dilip Kumar
Google
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-24 21:02:15 |
| Message-ID: | CAMT0RQQPMj3=EZ-4z6qRs_TmBHoyv2VHAdMrfDuwa5ZUY6XtHQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
The expectation was that as chunking is useful mainly in case of
really huge tables the analyze should have been run "recently enough".
Maybe we should use pg_relation_size() in case we have already
determined that the table is large enough to warrant chunking? Maybe
at least 1/2 of the requested chunk size?
My reasoning was to not put too much extra load on pg_dump in case
chunking is not required. But of course we can use the presence of a
chunking request to decide to run pg_relation_size(), assuming the
overhead won't be too large in this case.
On Mon, Nov 17, 2025 at 5:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Attached is a patch that adds the ability to dump table data in multiple chunks.
> >
> > Looking for feedback at this point:
> > 1) what have I missed
> > 2) should I implement something to avoid single-page chunks
> >
> > The flag --huge-table-chunk-pages which tells the directory format
> > dump to dump tables where the main fork has more pages than this in
> > multiple chunks of given number of pages,
> >
> > The main use case is speeding up parallel dumps in case of one or a
> > small number of HUGE tables so parts of these can be dumped in
> > parallel.
> >
>
> +1 for the idea, I haven't done the detailed review but I was just
> going through the patch, I noticed that we use pg_class->relpages to
> identify whether to chunk the table or not, which should be fine but
> don't you think if we use direct size calculation function like
> pg_relation_size() we might get better idea and not dependent upon
> whether the stats are updated or not? This will make chunking
> behavior more deterministic.
>
> --
> Regards,
> Dilip Kumar
> Google
| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2025-11-25 04:50:01 |
| Message-ID: | CAFiTN-scTeRAH0q2Ga3CLgkbcfcTi31cSw73ZVZntDQG7-fE+g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, Nov 25, 2025 at 2:32 AM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> The expectation was that as chunking is useful mainly in case of
> really huge tables the analyze should have been run "recently enough".
>
> Maybe we should use pg_relation_size() in case we have already
> determined that the table is large enough to warrant chunking? Maybe
> at least 1/2 of the requested chunk size?
>
> My reasoning was to not put too much extra load on pg_dump in case
> chunking is not required. But of course we can use the presence of a
> chunking request to decide to run pg_relation_size(), assuming the
> overhead won't be too large in this case.
>
>
> On Mon, Nov 17, 2025 at 5:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > Attached is a patch that adds the ability to dump table data in multiple chunks.
> > >
> > > Looking for feedback at this point:
> > > 1) what have I missed
> > > 2) should I implement something to avoid single-page chunks
> > >
> > > The flag --huge-table-chunk-pages which tells the directory format
> > > dump to dump tables where the main fork has more pages than this in
> > > multiple chunks of given number of pages,
> > >
> > > The main use case is speeding up parallel dumps in case of one or a
> > > small number of HUGE tables so parts of these can be dumped in
> > > parallel.
> > >
> >
> > +1 for the idea, I haven't done the detailed review but I was just
> > going through the patch, I noticed that we use pg_class->relpages to
> > identify whether to chunk the table or not, which should be fine but
> > don't you think if we use direct size calculation function like
> > pg_relation_size() we might get better idea and not dependent upon
> > whether the stats are updated or not? This will make chunking
> > behavior more deterministic.
Yeah that makes sense, we can use relpages for initial identification
and then use pg_relation_size() if relpages says the table is large
enough.
--
Regards,
Dilip Kumar
Google
| From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-01 07:01:04 |
| Message-ID: | CAFj8pRAp1Feh3O6NJkXuJQnG62YK-RJfgMfGQkXpe2PiPJS-Bg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi
pá 12. 12. 2025 v 9:02 odesílatel Hannu Krosing <hannuk(at)google(dot)com> napsal:
> Attached is a patch that adds the ability to dump table data in multiple
> chunks.
>
> Looking for feedback at this point:
> 1) what have I missed
> 2) should I implement something to avoid single-page chunks
>
> The flag --huge-table-chunk-pages which tells the directory format
> dump to dump tables where the main fork has more pages than this in
> multiple chunks of given number of pages,
>
> The main use case is speeding up parallel dumps in case of one or a
> small number of HUGE tables so parts of these can be dumped in
> parallel.
>
> It will also help in case the target file system has some limitations
> on file sizes (4GB for FAT, 5TB for GCS).
>
> Currently no tests are included in the patch and also no extra
> documentation outside what is printed out by pg_dump --help . Also any
> pg_log_warning lines with "CHUNKING" is there for debugging and needs
> to be removed before committing.
>
> As implemented no changes are needed for pg_restore as all chunks are
> already associated with the table in .toc and thus are restored into
> this table
>
> the attached README shows how I verified it works and the textual
> file created from the directory format dump in the last step there
>
I did first look on this patch and there are some white space issues
Regards
Pavel
>
> --
> Hannu
>
| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-13 02:27:25 |
| Message-ID: | CAApHDvr8ay+31Wd0TptDGp8cAg2-NOnWddx8csnUE3R03EbvZw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> Added to https://commitfest.postgresql.org/patch/6219/
I think this could be useful, but I think you'll need to find a way to
not do this for non-heap tables. Per the comments in TableAmRoutine,
both scan_set_tidrange and scan_getnextslot_tidrange are optional
callback functions and the planner won't produce TIDRangePaths if
either of those don't exist. Maybe that means you need to consult
pg_class.relam to ensure the amname is 'heap' or at least the relam =
2. On testing Citus's columnar AM, I get:
postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
ERROR: UPDATE and CTID scans not supported for ColumnarScan
1. For the patch, I think you should tighten the new option up to mean
the maximum segment size that a table will be dumped in. I see you
have comments like:
/* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */
You *have* to put the cutoff *somewhere*, so I think it very much
should be exactly the specified threshold. If anyone is unhappy that
some segments consist of a single page, then that's on them to adjust
the parameter accordingly. Otherwise, someone complaints that they got
a 1-page segment when the table was 10.0001% bigger than the cutoff
and then we're tempted to add a new setting to control the 1.1 factor,
which is just silly. If there's a 1-page segment, so what? It's not a
big deal.
Perhaps --max-table-segment-pages is a better name than
--huge-table-chunk-pages as it's quite subjective what the minimum
number of pages required to make a table "huge".
2. I'm not sure if you're going to get away with using relpages for
this. Is it really that bad to query pg_relation_size() when this
option is set? If it really is a problem, then maybe let the user
choose with another option. I understand we're using relpages for
sorting table sizes so we prefer dumping larger tables first, but that
just seems way less important if it's not perfectly accurate.
3. You should be able to simplify the code in dumpTableData() so
you're not adding any extra cases. You could use InvalidBlockNumber to
indicate an unbounded ctid range and only add ctid qual to the WHERE
clause when you have a bounded range (i.e not InvalidBlockNumber).
That way the first segment will need WHERE ctid <= '...' and the final
one will need WHERE ctid >= '...'. Everything in between will have an
upper and lower bound. That results in no ctid quals being added when
both ranges are set to InvalidBlockNumber, which you should use for
all tables not large enough to be segmented, thus no special case.
TID Range scans are perfectly capable of working when only bounded at one side.
4. I think using "int" here is a future complaint waiting to happen.
+ if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
+ &dopt.huge_table_chunk_pages))
I bet we'll eventually see a complaint that someone can't make the
segment size larger than 16TB. I think option_parse_uint32() might be
called for.
David
| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-13 02:32:21 |
| Message-ID: | CAApHDvozJq5PH+M6UiGYsE2Vq5dwSQu7xvDyDAyYDkd-VJbigw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> Added to https://commitfest.postgresql.org/patch/6219/
I see you added the "Backport" tag in the CF. This isn't the sort of
thing that we'd do that for. Was that a mistake?
David
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 10:52:04 |
| Message-ID: | CAMT0RQT_yk32+ZyA-cBS9uMtAUOM-b+hqj0zT0Ge-ikFMrmw+A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, Jan 13, 2026 at 3:27 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > Added to https://commitfest.postgresql.org/patch/6219/
>
> I think this could be useful, but I think you'll need to find a way to
> not do this for non-heap tables. Per the comments in TableAmRoutine,
> both scan_set_tidrange and scan_getnextslot_tidrange are optional
> callback functions and the planner won't produce TIDRangePaths if
> either of those don't exist. Maybe that means you need to consult
> pg_class.relam to ensure the amname is 'heap' or at least the relam =
> 2.
Makes sense, will add.
> On testing Citus's columnar AM, I get:
> postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
> ERROR: UPDATE and CTID scans not supported for ColumnarScan
Should we just silently not chunk tables that have some storage
architecture that does not have tids, or should pg_dump just error out
in thiscase ?
I imagine the Citus columnar is often used with huge tables where
chunking would be most useful.
Later it likely makes sense to have another option for chunking other
types of tables, or maybe evan add something to the TableAM for
chunking support.
> 1. For the patch, I think you should tighten the new option up to mean
> the maximum segment size that a table will be dumped in. I see you
> have comments like:
>
> /* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */
>
> You *have* to put the cutoff *somewhere*, so I think it very much
> should be exactly the specified threshold. If anyone is unhappy that
> some segments consist of a single page, then that's on them to adjust
> the parameter accordingly. Otherwise, someone complaints that they got
> a 1-page segment when the table was 10.0001% bigger than the cutoff
> and then we're tempted to add a new setting to control the 1.1 factor,
> which is just silly. If there's a 1-page segment, so what? It's not a
> big deal.
Agreed, will drop the TODO
> Perhaps --max-table-segment-pages is a better name than
> --huge-table-chunk-pages as it's quite subjective what the minimum
> number of pages required to make a table "huge".
I agree. My initial thinking was that it is mainly useful for huge
tables, but indeed that does not need to be reflected in the flag name
> 2. I'm not sure if you're going to get away with using relpages for
> this. Is it really that bad to query pg_relation_size() when this
> option is set? If it really is a problem, then maybe let the user
> choose with another option. I understand we're using relpages for
> sorting table sizes so we prefer dumping larger tables first, but that
> just seems way less important if it's not perfectly accurate.
Yeah, I had thought of pg_relation_size() myself.
Another option would be something more complex which tries to estimate
the dump file sizes by figuring out TOAST for each chunk. The think
that makes this really complex is the possible uneven distribution of
toast and needing to take into account both the compression of toast
AND the compression of resulting dump file.
> 3. You should be able to simplify the code in dumpTableData() so
> you're not adding any extra cases. You could use InvalidBlockNumber to
> indicate an unbounded ctid range and only add ctid qual to the WHERE
> clause when you have a bounded range (i.e not InvalidBlockNumber).
> That way the first segment will need WHERE ctid <= '...' and the final
> one will need WHERE ctid >= '...'. Everything in between will have an
> upper and lower bound. That results in no ctid quals being added when
> both ranges are set to InvalidBlockNumber, which you should use for
> all tables not large enough to be segmented, thus no special case.
Makes sense, will look into it.
> TID Range scans are perfectly capable of working when only bounded at one side.
>
> 4. I think using "int" here is a future complaint waiting to happen.
>
> + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> + &dopt.huge_table_chunk_pages))
>
> I bet we'll eventually see a complaint that someone can't make the
> segment size larger than 16TB. I think option_parse_uint32() might be
> called for.
There can be no more than 2 * INT2_MAX pages anyway.
I thought half of the max possible size should be enough.
Do you really think that somebody would want that ?
> David
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 10:52:54 |
| Message-ID: | CAMT0RQR1C8YbrAA4kXLt7QSPkFKkMc29Uh1aMPSmEiv97gkmvg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > Added to https://commitfest.postgresql.org/patch/6219/
>
> I see you added the "Backport" tag in the CF. This isn't the sort of
> thing that we'd do that for. Was that a mistake?
Just wanted to mark it as something that might be backported later as
one of the important cases for pg_dump is dumping older databases .
Will remove if the Tag means something else.
| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 21:10:55 |
| Message-ID: | CAApHDvr1nRk=ZDszZOUrh-UXbE-6Fi6cDN5pEF+ZEcqJRnZbfw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, 14 Jan 2026 at 23:52, Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> On Tue, Jan 13, 2026 at 3:27 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > On testing Citus's columnar AM, I get:
> > postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
> > ERROR: UPDATE and CTID scans not supported for ColumnarScan
>
> Should we just silently not chunk tables that have some storage
> architecture that does not have tids, or should pg_dump just error out
> in thiscase ?
I think you should just document that it only applies to heap tables.
I don't think erroring out is useful to anyone, especially if the
error only arrives after pg_dump has been running for several hours or
even days.
> > 4. I think using "int" here is a future complaint waiting to happen.
> >
> > + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> > + &dopt.huge_table_chunk_pages))
> >
> > I bet we'll eventually see a complaint that someone can't make the
> > segment size larger than 16TB. I think option_parse_uint32() might be
> > called for.
>
> There can be no more than 2 * INT2_MAX pages anyway.
> I thought half of the max possible size should be enough.
> Do you really think that somebody would want that ?
IMO, if the option can't represent the full range of BlockNumber, then
that's a bug.
I see pg_resetwal has recently invented strtouint32_strict for this.
It might be a good idea to refactor that and put it into
option_utils.c rather than having each client app have to invent their
own method.
David
| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 21:40:21 |
| Message-ID: | iqhlowrl7c7cmu3e35vzg65qf3rccilvpo7eeymh7pakp3xdqw@gh322zepbst6 |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On 2026-01-14 11:52:54 +0100, Hannu Krosing wrote:
> On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> >
> > On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > Added to https://commitfest.postgresql.org/patch/6219/
> >
> > I see you added the "Backport" tag in the CF. This isn't the sort of
> > thing that we'd do that for. Was that a mistake?
>
> Just wanted to mark it as something that might be backported later as
> one of the important cases for pg_dump is dumping older databases .
I think it's obvious that nothing that's being discussed here has any business
being considered for backporting.
Nor can I follow the argument that dumping of old databases would be an
argument, as the proposed change is in pg_dump, not the server, so a new
pg_dump will suffice to get the benefit, no?
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 21:46:07 |
| Message-ID: | CAMT0RQQVSr6oQB0N2s0M_tENQzp-KLV-ze-sij7mHky7LTzHbg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, Jan 14, 2026 at 10:40 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2026-01-14 11:52:54 +0100, Hannu Krosing wrote:
> > On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > >
> > > On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > > > Added to https://commitfest.postgresql.org/patch/6219/
> > >
> > > I see you added the "Backport" tag in the CF. This isn't the sort of
> > > thing that we'd do that for. Was that a mistake?
> >
> > Just wanted to mark it as something that might be backported later as
> > one of the important cases for pg_dump is dumping older databases .
>
> I think it's obvious that nothing that's being discussed here has any business
> being considered for backporting.
Do we have clear written guidelines about what can and can not be backported ?
And do we distinguish between the core database, extensions and tools for this?
> Nor can I follow the argument that dumping of old databases would be an
> argument, as the proposed change is in pg_dump, not the server, so a new
> pg_dump will suffice to get the benefit, no?
I was not sure if pg_restore is backwards compatible enough to be able
to restore from newer pg_dump versions.
----
Hannu
| From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 21:53:28 |
| Message-ID: | aWgQWHCfKCWx6aZ6@nathan |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, Jan 14, 2026 at 10:46:07PM +0100, Hannu Krosing wrote:
> Do we have clear written guidelines about what can and can not be backported ?
> And do we distinguish between the core database, extensions and tools for this?
From https://www.postgresql.org/support/versioning:
Minor releases only contain fixes for frequently-encountered bugs,
low-risk fixes, security issues, and data corruption problems.
--
nathan
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 22:10:38 |
| Message-ID: | CAMT0RQSQG0gnLQbKnB00CupwuE4pjyXR91-VdpHQz4=m1MULVQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Thanks Nathan,
Is this the whole guidelines we have for this ?
Did not recognise it at once, but I assume that this implies also
things about backporting, as only way to backport something is to put
it in a minor release.
I was hoping that there were some exceptions possible for things
affecting interactions between different versions that need also
support from older versions, but if this page is all we have then
likely not.
BTW, this is also why I am not entirely happy about logical
replication being part of the core, as this implies that no bigger
improvements are possible for cases when it is used for version
upgrades. Or at least the improvements would be of no use for
upgrading existingdatabases .
Do we have the same strict no improvements in minor versions policy
for contrib/ extensions and tools?
On Wed, Jan 14, 2026 at 10:53 PM Nathan Bossart
<nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Jan 14, 2026 at 10:46:07PM +0100, Hannu Krosing wrote:
> > Do we have clear written guidelines about what can and can not be backported ?
> > And do we distinguish between the core database, extensions and tools for this?
>
> From https://www.postgresql.org/support/versioning:
>
> Minor releases only contain fixes for frequently-encountered bugs,
> low-risk fixes, security issues, and data corruption problems.
>
> --
> nathan
| From: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 22:17:11 |
| Message-ID: | CAKFQuwbQFs06FgR3q7V0fd1PibCYZTpSDGBc3oFYn9OuO0xD6Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wednesday, January 14, 2026, Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Do we have the same strict no improvements in minor versions policy
> for contrib/ extensions and tools?
>
Yes, that policy applies to the entire repository/project.
David J.
| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-14 22:17:27 |
| Message-ID: | xjgdk3zpejnpm6buctc2z6f5hjyfncml4gbcdwx2mm7p6hfrn6@a63co5x57nll |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi,
On 2026-01-14 23:10:38 +0100, Hannu Krosing wrote:
> Is this the whole guidelines we have for this ?
I don't think we have more written down in a central place.
> Did not recognise it at once, but I assume that this implies also
> things about backporting, as only way to backport something is to put
> it in a minor release.
Correct.
> I was hoping that there were some exceptions possible for things
> affecting interactions between different versions that need also
> support from older versions, but if this page is all we have then
> likely not.
I think we have made maybe a handful of exceptions over the years, not more.
> Do we have the same strict no improvements in minor versions policy
> for contrib/ extensions and tools?
Yes.
Greetings,
Andres Freund
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de>, Devrim Gündüz <devrim(at)gunduz(dot)org> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-15 09:46:36 |
| Message-ID: | CAMT0RQSudopnkyrk+k=5naHo2GiOcJsSGQBOfF1ovsbTs7N6Ew@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, Jan 14, 2026 at 11:17 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-01-14 23:10:38 +0100, Hannu Krosing wrote:
> > Is this the whole guidelines we have for this ?
>
> I don't think we have more written down in a central place.
>
>
> > Did not recognise it at once, but I assume that this implies also
> > things about backporting, as only way to backport something is to put
> > it in a minor release.
>
> Correct.
Thanks for confirming!
> > I was hoping that there were some exceptions possible for things
> > affecting interactions between different versions that need also
> > support from older versions, but if this page is all we have then
> > likely not.
>
> I think we have made maybe a handful of exceptions over the years, not more.
>
> > Do we have the same strict no improvements in minor versions policy
> > for contrib/ extensions and tools?
>
> Yes.
I have confirmed that one can not load a -Fd dump from newer version
of pg_dump using an older version of pg_restore.
It plainly refuses without even trying.
~/tmp$ /usr/lib/postgresql/17/bin/pg_dump -p 5433 -Fd testload -f dump17
~/tmp$ /usr/lib/postgresql/16/bin/pg_restore -p 5432 -C -d postgres dump17
pg_restore: error: unsupported version (1.16) in file header
Have we thought of other ways than backporting for supporting clients
who may improvements for older versions ?
I know that 2ndQuardrant maintained its own distribution for a while
for enterprise customers who absolutely needed some improvements.
Could PGDG repos, for example, have a separate "unofficial" backports
section for this?
----
Hannu
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-19 19:01:29 |
| Message-ID: | CAMT0RQShjXPPdXQS-5uzDC3bXt+QEZR5tO02o1NHdXWNu2quvw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Here is a new patch which has
* changed flag name to max-table-segment-pages
* added check for amname = "heap"
* made the table info query use pg_relation_size() to get relpages if
the --max-table-segment-pages is set
* added simple chunked dump and restore test
Currently there is no check for actual restore integrity, this is
what t/002_pg_dump.pl says:
# TODO: Have pg_restore actually restore to an independent
# database and then pg_dump *that* database (or something along
# those lines) to validate that part of the process.
As my perl-fu is weak I did not build the new facility to have full
restored data checking, but I did add simple count + table hash
warnings for original and restored data so I could manually verify tha
restore
added this for original and chunked restore database:
DO \$\$
DECLARE
thash_rec RECORD;
BEGIN
SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash
INTO thash_rec
FROM tplain AS t;
RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
END;
\$\$;
And this is the verification I did after running `make check` in
src/bin/pg_dump/
hannu(at)HK395:~/work/pggit/src/bin/pg_dump$ grep "WARNING.*thash"
tmp_check/log/004_pg_dump_parallel_main.log
RAISE WARNING 'thash: %', thash_rec;
2026-01-19 19:27:57.444 CET client backend[678937]
004_pg_dump_parallel.pl WARNING: thash: (tplain,1000,38441792160)
RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
2026-01-19 19:27:57.605 CET client backend[678985]
004_pg_dump_parallel.pl WARNING: thash after parallel chunked
restore: (tplain,1000,38441792160)
As you see both have 1000 rows with sum of full row hashes == 38441792160
Other rows in the same log foile show that it was dumped as 3 chunks
as I still have the Warnings in code which show the query used.
Anyone with a better understanding of our Perl tests is welcome to
turn this into proper tests or advise me where to find info on how to
do it.
On Tue, Jan 13, 2026 at 3:27 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
...
> 3. You should be able to simplify the code in dumpTableData() so
> you're not adding any extra cases. You could use InvalidBlockNumber to
> indicate an unbounded ctid range and only add ctid qual to the WHERE
> clause when you have a bounded range (i.e not InvalidBlockNumber).
> That way the first segment will need WHERE ctid <= '...' and the final
> one will need WHERE ctid >= '...'. Everything in between will have an
> upper and lower bound. That results in no ctid quals being added when
> both ranges are set to InvalidBlockNumber, which you should use for
> all tables not large enough to be segmented, thus no special case.
>
> TID Range scans are perfectly capable of working when only bounded at one side.
I changed the last open-ended chunk to use ctid >= (N,1) for clarity
but did not change anything else.
To me it looked like having a loop around the whole thing when there
is no chunking would complicate things for anyone reading the code.
> 4. I think using "int" here is a future complaint waiting to happen.
>
> + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> + &dopt.huge_table_chunk_pages))
>
> I bet we'll eventually see a complaint that someone can't make the
> segment size larger than 16TB. I think option_parse_uint32() might be
> called for.
I have not yet done anything with this yet, so the maximum chunk size
for now is half of the maximum relpages.
| Attachment | Content-Type | Size |
|---|---|---|
| v7-0001-changed-flag-name-to-max-table-segment-pages.patch | application/x-patch | 14.5 KB |
| From: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-19 21:15:33 |
| Message-ID: | CAN4CZFOTOX1nFkPrmM7fQ9qnrccvjwywXPf4Eo7C+DF-h-x96g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hello
pgdump.c:7174
+ appendPQExpBufferStr(query, "pg_relation_size(c.tableoid)/8192 AS
relpages, ");
Shouldn't this be something like
+ appendPQExpBufferStr(query,
"pg_relation_size(c.oid)/current_setting('block_size')::int AS
relpages, ");
instead?
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-19 23:07:56 |
| Message-ID: | CAMT0RQTbQxjdN5nv6M_HhFWiLqdT84=NYBM1ZYQKaAcf8Ufyaw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Thanks Zsolt
I changed it to use the configured BLCKSZ in attached patch.
But you may be right that current_setting('block_size')::int is the
better way to go as we are interested in the page size at the target
database, not what the pg_sump was compiled with
I'll wait for other feedback as well and then send the next patc with changes
On Mon, Jan 19, 2026 at 10:15 PM Zsolt Parragi
<zsolt(dot)parragi(at)percona(dot)com> wrote:
>
> Hello
>
> pgdump.c:7174
>
> + appendPQExpBufferStr(query, "pg_relation_size(c.tableoid)/8192 AS
> relpages, ");
>
> Shouldn't this be something like
>
> + appendPQExpBufferStr(query,
> "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> relpages, ");
>
> instead?
| Attachment | Content-Type | Size |
|---|---|---|
| v8-0001-changed-flag-name-to-max-table-segment-pages.patch | application/x-patch | 14.6 KB |
| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-20 02:20:13 |
| Message-ID: | CAApHDvruGg8u9t2OQbVU+Sxe31H0CuJi5S69FjZXds1Y3Fx3HA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, 20 Jan 2026 at 08:01, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> I changed the last open-ended chunk to use ctid >= (N,1) for clarity
> but did not change anything else.
You have:
int max_table_segment_pages; /* chunk when relpages is above this */
and:
opts->max_table_segment_pages = UINT32_MAX; /* == InvalidBlockNumber,
disable chunking by default */
It's not valid to assign UINT32_MAX to a signed int.
> To me it looked like having a loop around the whole thing when there
> is no chunking would complicate things for anyone reading the code.
The problem I have with it is the duplicate code. If you don't want to
loop around the standard code, then make a function and call that
instead of copying and pasting the code.
I'd also get rid of the "chunking" boolean and make use of
InvalidBlockNumber to determine if the range is constrained. It also
seems very strange that you opted to do that just for endPage and not
for startPage.
> > 4. I think using "int" here is a future complaint waiting to happen.
> >
> > + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> > + &dopt.huge_table_chunk_pages))
> >
> > I bet we'll eventually see a complaint that someone can't make the
> > segment size larger than 16TB. I think option_parse_uint32() might be
> > called for.
>
> I have not yet done anything with this yet, so the maximum chunk size
> for now is half of the maximum relpages.
OK. I can look again once all that's done.
David
| From: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-20 06:13:56 |
| Message-ID: | CAN4CZFOjG2kUciKeVBpxrHJZNkZuzY_Q5ij_qpDZcAS3Ak2GxA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hello
I changed two things on the line in my previous email, I think
c.tableoid is also wrong, pg_relation_size(c.tableoid) will return the
size of pg_class, not the size of the relation in question, that
should be pg_relation_size(c.oid)
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-20 12:48:44 |
| Message-ID: | CAMT0RQStjytRrGTe0X03ErC7anwxNRHAULYBsSmdWZV3fr4-Dg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Tue, Jan 20, 2026 at 7:14 AM Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> wrote:
>
> Hello
>
> I changed two things on the line in my previous email, I think
> c.tableoid is also wrong, pg_relation_size(c.tableoid) will return the
> size of pg_class, not the size of the relation in question, that
> should be pg_relation_size(c.oid)
Thanks, a good catch :)
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-21 13:05:46 |
| Message-ID: | CAMT0RQROPMSPwfxAUCm1gZs9cUr7FmvwX+eO6Wzq_wWdd6eEAQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Please find the latest patch attached which incorporates the feedback received.
* changed flag name to --max-table-segment-pages
* added check for amname = "heap"
* switched to using of
pg_relation_size(c.oid)/current_setting('block_size')::int when
--max-table-segment-pages is set
* added option_parse_uint32(...) to be used for full range of pages numbers
* The COPY SELECTs now use <= , BETWEEN or >= depending on the segment position
* added documentation
* TESTS:
* added simple chunked dump and restore test
* added a WARNING with count and table data hash to source and
chunked restore database
I left in the boolean to indicate if this is a full table or chunk
(was named chunking, nor is_segment)
An a lternative would be to use an expression like (td->startPage != 0
|| td->endPage != InvalidBlockNumber) whenever td->is_segment is
needed
If you insist on not having a separate structure member we could turn
this into something like this
#define is_segment(td) ((td->startPage != 0 || td->endPage !=
InvalidBlockNumber))
and then use is_segment(td) instead of td->is_segment where needed.
| Attachment | Content-Type | Size |
|---|---|---|
| v9-0001-changed-flag-name-to-max-table-segment-pages.patch | application/x-patch | 19.3 KB |
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-22 17:05:04 |
| Message-ID: | CAMT0RQQ8DX+K7OTw3Lg+Yp2ew8TsZduiqtPszfiBixcpxKbz-A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Fixing all the warnings
On Wed, Jan 21, 2026 at 2:05 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Please find the latest patch attached which incorporates the feedback received.
>
> * changed flag name to --max-table-segment-pages
> * added check for amname = "heap"
> * switched to using of
> pg_relation_size(c.oid)/current_setting('block_size')::int when
> --max-table-segment-pages is set
> * added option_parse_uint32(...) to be used for full range of pages numbers
> * The COPY SELECTs now use <= , BETWEEN or >= depending on the segment position
>
> * added documentation
>
> * TESTS:
> * added simple chunked dump and restore test
> * added a WARNING with count and table data hash to source and
> chunked restore database
>
> I left in the boolean to indicate if this is a full table or chunk
> (was named chunking, nor is_segment)
>
> An a lternative would be to use an expression like (td->startPage != 0
> || td->endPage != InvalidBlockNumber) whenever td->is_segment is
> needed
>
> If you insist on not having a separate structure member we could turn
> this into something like this
>
> #define is_segment(td) ((td->startPage != 0 || td->endPage !=
> InvalidBlockNumber))
>
> and then use is_segment(td) instead of td->is_segment where needed.
| Attachment | Content-Type | Size |
|---|---|---|
| v10-0001-changed-flag-name-to-max-table-segment-pages.patch | application/x-patch | 19.4 KB |
| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-23 02:15:37 |
| Message-ID: | CAApHDvo29-vQz=xV6+x5hU--NZ9qGPXsCNBuOAf88pAHjTpvvQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Fixing all the warnings
I think overall this needs significantly more care and precision than
what you've given it so far. For example, you have:
+ if(dopt->max_table_segment_pages != InvalidBlockNumber)
+ appendPQExpBufferStr(query,
"pg_relation_size(c.oid)/current_setting('block_size')::int AS
relpages, ");
+ else
+ appendPQExpBufferStr(query, "c.relpages, ");
Note that pg_class.relpages is "int". Later the code in master does:
tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
num_pages;" that the value stored in relpages will be negative when
the table is >= 16TB (assuming 8k pages). Your pg_relation_size
expression is not going to produce an INT. It'll produce a BIGINT, per
"select pg_typeof(pg_relation_size('pg_class') /
current_setting('block_size')::int);". So the atoi() can receive a
string of digits representing an integer larger than INT_MAX in this
case. Looking at [1], I see:
"7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
and atoll need not affect the value of the integer expression errno on
an error. If the value of the result cannot be represented, *the
behavior is undefined.*"
And testing locally, I see that my Microsoft compiler will just return
INT_MAX on overflow, whereas I see gcc does nothing to prevent
overflows and just continues to multiply by 10 regardless of what
overflows occur, which I think would just make the code work by
accident.
Aside from that, nothing in the documentation mentions that this is
for "heap" tables only. That should be mentioned as it'll just result
in people posting questions about why it's not working for some other
table access method. There's also not much care for white space.
You've introduced a bunch of whitespace changes unrelated to code
changes you've made, plus there's not much regard for following
project standard. For example, you commonly do "if(" and don't
consistently follow the bracing rules, e.g:
+ for(chkptr = optarg; *chkptr != '\0'; chkptr++)
+ if(*chkptr == '-')
Things like the following help convey the level of care that's gone into this:
+/*
+ * option_parse_int
+ *
+ * Parse integer value for an option. If the parsing is successful, returns
+ * true and stores the result in *result if that's given; if parsing fails,
+ * returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
i.e zero effort gone in to modify the comments after pasting them from
option_parse_int().
Another example:
+ pg_log_error("%s musst be in range %lu..%lu",
Also, I have no comprehension of why you'd use uint64 for the valid
range when the function is for processing uint32 types in:
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+ uint64 min_range, uint64 max_range,
+ uint32 *result)
In its current state, it's quite hard to take this patch seriously.
Please spend longer self-reviewing it before posting. You could
temporarily hard-code something for testing which makes at least 1
table appear to be larger than 16TB and ensure your code works. What
you have is visually broken and depends on whatever the atoi
implementation opts to do in the overflow case. These are all things
diligent commiters will be testing and it's sad to see how little
effort you're putting into this. How do you expect this community to
scale with this quality level of patch submissions? You've been around
long enough and should know and do better. Are you just expecting the
committer to fix these things for you? That work does not get done via
magic wand. Being on v10 already, I'd have expected the patch to be
far beyond proof of concept grade. If you're withholding investing
time on this until you see more community buy-in, then I'd suggest you
write that and withhold further revisions until you're happy with the
level of buy-in.
I'm also still not liking your de-normalised TableInfo representation
for "is_segment". IMO, InvalidBlockNumber should be used to represent
open bounded ranges, and if there's no chunking, then startPage and
endPage will both be InvalidBlockNumber. IMO, what you have now
needlessly allows invalid states where is_segment == true and
startPage, endPage are not set correctly. If you want to keep the code
simple, hide the complexity in a macro or an inline function. There's
just no performance reason to materialise the more complex condition
into a dedicated boolean flag.
If the quality level of this has not improved significantly by v11,
count me out.
David
[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-27 22:43:44 |
| Message-ID: | CAMT0RQQT1dQTPj4etRTc0877mirCPVKzjbF_U5KRAyPwhHMr0Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi David
Thanks for reviewing.
Please hold back reviewing this v12 patch until I have verified that
it passes CfBot and until I have finished my testing with 17TB table.
I would appreciate pointers or help on adding the data correctness
tests to the tap tests as real tests.
For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
I check manually that the table data checksums match
If we add them we should add them to other places in pg_dump tests as
well. Currently we just test that dump and restore do not fail, not
that the restored data is correct.
On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > Fixing all the warnings
>
> I think overall this needs significantly more care and precision than
> what you've given it so far. For example, you have:
>
> + if(dopt->max_table_segment_pages != InvalidBlockNumber)
> + appendPQExpBufferStr(query,
> "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> relpages, ");
> + else
> + appendPQExpBufferStr(query, "c.relpages, ");
>
> Note that pg_class.relpages is "int". Later the code in master does:
>
> tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
I have now fixed the base issue by changing the data type of
TableInfo.relpages to BlockNumber, and also changed the way we get it
by
1. converting it to unsigned int ( c.relpages::oid ) in the query
2. reading it from the result using strtoul()
(technically it should have been enough to just use strtoul() as it
already wraps signed ints to unsigned ones, but having it converted in
the query seems cleaner)
This allowed removing casts to (BlockNumber) everywhere where
.relpages was used.
Functionally value was ever only used for ordering and even this
loosley, which explains why patch v10 did not break anything.
I also changed the data type of TocEntry.dataLength from pgoff_t to
uint64. The current clearly had an overflow in case when off_t was 32
bit and sum of relpages from heap and toast was larger than allowed
for it.
> If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> num_pages;" that the value stored in relpages will be negative when
> the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> expression is not going to produce an INT. It'll produce a BIGINT, per
> "select pg_typeof(pg_relation_size('pg_class') /
> current_setting('block_size')::int);". So the atoi() can receive a
> string of digits representing an integer larger than INT_MAX in this
> case. Looking at [1], I see:
As said above this should be fixed now by using correct type in struch
and strtoul().
To be sure I have now created a 17TB table and running some tests on
this as well.
Will let you know here when done.
> "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> and atoll need not affect the value of the integer expression errno on
> an error. If the value of the result cannot be represented, *the
> behavior is undefined.*"
>
> And testing locally, I see that my Microsoft compiler will just return
> INT_MAX on overflow, whereas I see gcc does nothing to prevent
> overflows and just continues to multiply by 10 regardless of what
> overflows occur, which I think would just make the code work by
> accident.
As .relpages was only ever used for ordering parallel copies it does
work just not optimally.
The old code has similar overflow/wraparound for case when off_t is 32
bit int and the sum of relpages from heap and toast table is above
INT_MAX
I have removed the whole part where this was partially fixed for the
case when one of them was > 0x7fffffff (i.e. negative) by pinning the
dataLength to INT_MAX in that case
> Aside from that, nothing in the documentation mentions that this is
> for "heap" tables only. That should be mentioned as it'll just result
> in people posting questions about why it's not working for some other
> table access method. There's also not much care for white space.
> You've introduced a bunch of whitespace changes unrelated to code
> changes you've made, plus there's not much regard for following
> project standard. For example, you commonly do "if(" and don't
> consistently follow the bracing rules, e.g:
>
> + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> + if(*chkptr == '-')
I assumed that it is the classical "single statemet -- no braces.
Do we have a writeup of our coding standards somewhere ?
Now this specific case is rewritten using while() so shoud be ok.
> Things like the following help convey the level of care that's gone into this:
>
> +/*
> + * option_parse_int
> + *
> + * Parse integer value for an option. If the parsing is successful, returns
> + * true and stores the result in *result if that's given; if parsing fails,
> + * returns false.
> + */
> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
>
> i.e zero effort gone in to modify the comments after pasting them from
> option_parse_int().
>
> Another example:
>
> + pg_log_error("%s musst be in range %lu..%lu",
>
> Also, I have no comprehension of why you'd use uint64 for the valid
> range when the function is for processing uint32 types in:
The uint64 there I picked up from the referenced long unsigned usage
in pg_resetval after I managed to get pg_log_warning to print out -1
for format %u and did not want to go to debug why that happens.
I have now made all the arguments uint32
> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
> + uint64 min_range, uint64 max_range,
> + uint32 *result)
>
> In its current state, it's quite hard to take this patch seriously.
> Please spend longer self-reviewing it before posting. You could
> temporarily hard-code something for testing which makes at least 1
> table appear to be larger than 16TB and ensure your code works. What
> you have is visually broken and depends on whatever the atoi
> implementation opts to do in the overflow case. These are all things
> diligent commiters will be testing and it's sad to see how little
> effort you're putting into this. How do you expect this community to
> scale with this quality level of patch submissions? You've been around
> long enough and should know and do better. Are you just expecting the
> committer to fix these things for you? That work does not get done via
> magic wand. Being on v10 already, I'd have expected the patch to be
> far beyond proof of concept grade. If you're withholding investing
> time on this until you see more community buy-in, then I'd suggest you
> write that and withhold further revisions until you're happy with the
> level of buy-in.
> I'm also still not liking your de-normalised TableInfo representation
> for "is_segment".
> IMO, InvalidBlockNumber should be used to represent
> open bounded ranges, and if there's no chunking, then startPage and
> endPage will both be InvalidBlockNumber.
That's what I ended up doing
I switched to using startPage = InvalidBlockNumber to indicate that no
chunking is in effect.
This is safe because when chunking is in use I always try to set both
chunk end pages, and lower bound I can always set the lower bound.
Only for the last page is the endPage left to InvalidBlockNumber.
> IMO, what you have now
> needlessly allows invalid states where is_segment == true and
> startPage, endPage are not set correctly. If you want to keep the code
> simple, hide the complexity in a macro or an inline function. There's
> just no performance reason to materialise the more complex condition
> into a dedicated boolean flag.
| Attachment | Content-Type | Size |
|---|---|---|
| v12-0001-changed-flag-name-to-max-table-segment-pages.patch | text/x-patch | 21.9 KB |
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-28 17:29:58 |
| Message-ID: | CAMT0RQQKWWrYYYQ8QNurs7hXBC5DwBAV6b0JmHKvJk7wZnup0g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
v13 has added a proper test comparing original and restored table data
On Tue, Jan 27, 2026 at 11:43 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Hi David
>
> Thanks for reviewing.
>
> Please hold back reviewing this v12 patch until I have verified that
> it passes CfBot and until I have finished my testing with 17TB table.
>
> I would appreciate pointers or help on adding the data correctness
> tests to the tap tests as real tests.
> For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
> I check manually that the table data checksums match
> If we add them we should add them to other places in pg_dump tests as
> well. Currently we just test that dump and restore do not fail, not
> that the restored data is correct.
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> >
> > On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > Fixing all the warnings
> >
> > I think overall this needs significantly more care and precision than
> > what you've given it so far. For example, you have:
> >
> > + if(dopt->max_table_segment_pages != InvalidBlockNumber)
> > + appendPQExpBufferStr(query,
> > "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> > relpages, ");
> > + else
> > + appendPQExpBufferStr(query, "c.relpages, ");
> >
> > Note that pg_class.relpages is "int". Later the code in master does:
> >
> > tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
>
> I have now fixed the base issue by changing the data type of
> TableInfo.relpages to BlockNumber, and also changed the way we get it
> by
>
> 1. converting it to unsigned int ( c.relpages::oid ) in the query
> 2. reading it from the result using strtoul()
>
> (technically it should have been enough to just use strtoul() as it
> already wraps signed ints to unsigned ones, but having it converted in
> the query seems cleaner)
>
> This allowed removing casts to (BlockNumber) everywhere where
> .relpages was used.
>
> Functionally value was ever only used for ordering and even this
> loosley, which explains why patch v10 did not break anything.
>
> I also changed the data type of TocEntry.dataLength from pgoff_t to
> uint64. The current clearly had an overflow in case when off_t was 32
> bit and sum of relpages from heap and toast was larger than allowed
> for it.
>
> > If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> > num_pages;" that the value stored in relpages will be negative when
> > the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> > expression is not going to produce an INT. It'll produce a BIGINT, per
> > "select pg_typeof(pg_relation_size('pg_class') /
> > current_setting('block_size')::int);". So the atoi() can receive a
> > string of digits representing an integer larger than INT_MAX in this
> > case. Looking at [1], I see:
>
> As said above this should be fixed now by using correct type in struch
> and strtoul().
> To be sure I have now created a 17TB table and running some tests on
> this as well.
> Will let you know here when done.
>
> > "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> > and atoll need not affect the value of the integer expression errno on
> > an error. If the value of the result cannot be represented, *the
> > behavior is undefined.*"
> >
> > And testing locally, I see that my Microsoft compiler will just return
> > INT_MAX on overflow, whereas I see gcc does nothing to prevent
> > overflows and just continues to multiply by 10 regardless of what
> > overflows occur, which I think would just make the code work by
> > accident.
>
> As .relpages was only ever used for ordering parallel copies it does
> work just not optimally.
>
> The old code has similar overflow/wraparound for case when off_t is 32
> bit int and the sum of relpages from heap and toast table is above
> INT_MAX
>
> I have removed the whole part where this was partially fixed for the
> case when one of them was > 0x7fffffff (i.e. negative) by pinning the
> dataLength to INT_MAX in that case
>
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only. That should be mentioned as it'll just result
> > in people posting questions about why it's not working for some other
> > table access method. There's also not much care for white space.
> > You've introduced a bunch of whitespace changes unrelated to code
> > changes you've made, plus there's not much regard for following
> > project standard. For example, you commonly do "if(" and don't
> > consistently follow the bracing rules, e.g:
> >
> > + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> > + if(*chkptr == '-')
>
> I assumed that it is the classical "single statemet -- no braces.
>
> Do we have a writeup of our coding standards somewhere ?
>
> Now this specific case is rewritten using while() so shoud be ok.
>
> > Things like the following help convey the level of care that's gone into this:
> >
> > +/*
> > + * option_parse_int
> > + *
> > + * Parse integer value for an option. If the parsing is successful, returns
> > + * true and stores the result in *result if that's given; if parsing fails,
> > + * returns false.
> > + */
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> >
> > i.e zero effort gone in to modify the comments after pasting them from
> > option_parse_int().
> >
> > Another example:
> >
> > + pg_log_error("%s musst be in range %lu..%lu",
> >
> > Also, I have no comprehension of why you'd use uint64 for the valid
> > range when the function is for processing uint32 types in:
>
> The uint64 there I picked up from the referenced long unsigned usage
> in pg_resetval after I managed to get pg_log_warning to print out -1
> for format %u and did not want to go to debug why that happens.
>
> I have now made all the arguments uint32
>
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> > + uint64 min_range, uint64 max_range,
> > + uint32 *result)
> >
> > In its current state, it's quite hard to take this patch seriously.
> > Please spend longer self-reviewing it before posting. You could
> > temporarily hard-code something for testing which makes at least 1
> > table appear to be larger than 16TB and ensure your code works. What
> > you have is visually broken and depends on whatever the atoi
> > implementation opts to do in the overflow case. These are all things
> > diligent commiters will be testing and it's sad to see how little
> > effort you're putting into this. How do you expect this community to
> > scale with this quality level of patch submissions? You've been around
> > long enough and should know and do better. Are you just expecting the
> > committer to fix these things for you? That work does not get done via
> > magic wand. Being on v10 already, I'd have expected the patch to be
> > far beyond proof of concept grade. If you're withholding investing
> > time on this until you see more community buy-in, then I'd suggest you
> > write that and withhold further revisions until you're happy with the
> > level of buy-in.
>
> > I'm also still not liking your de-normalised TableInfo representation
> > for "is_segment".
> > IMO, InvalidBlockNumber should be used to represent
> > open bounded ranges, and if there's no chunking, then startPage and
> > endPage will both be InvalidBlockNumber.
>
> That's what I ended up doing
>
> I switched to using startPage = InvalidBlockNumber to indicate that no
> chunking is in effect.
>
> This is safe because when chunking is in use I always try to set both
> chunk end pages, and lower bound I can always set the lower bound.
>
> Only for the last page is the endPage left to InvalidBlockNumber.
>
> > IMO, what you have now
> > needlessly allows invalid states where is_segment == true and
> > startPage, endPage are not set correctly. If you want to keep the code
> > simple, hide the complexity in a macro or an inline function. There's
> > just no performance reason to materialise the more complex condition
> > into a dedicated boolean flag.
| Attachment | Content-Type | Size |
|---|---|---|
| v13-0001-changed-flag-name-to-max-table-segment-pages.patch | application/x-patch | 21.2 KB |
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-28 21:27:30 |
| Message-ID: | CAMT0RQSXhbLPe3Tru-CU+tB4yrzYWx4P6g9R6gya13ckhqVMyA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi David
About documentation :
On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> Aside from that, nothing in the documentation mentions that this is
> for "heap" tables only.
The <note> part did mention it and even gave an example
<varlistentry>
<term><option>--max-table-segment-pages=<replaceable
class="parameter">npages</replaceable></option></term>
<listitem>
<para>
Dump data in segments based on number of pages in the main relation.
If the number of data pages in the relation is more than
<replaceable class="parameter">npages</replaceable>
the data is split into segments based on that number of pages.
Individual segments can be dumped in parallel.
</para>
<note>
<para>
The option <option>--max-table-segment-pages</option> is
applied to only pages
in the main heap and if the table has a large TOASTed part
this has to be
taken into account when deciding on the number of pages to use.
In the extreme case a single 8kB heap page can have ~200
toast pointers each
corresponding to 1GB of data. If this data is also
non-compressible then a
single-page segment can dump as 200GB file.
</para>
</note>
Would it be a good idea to add a 2nd paragraph like this to make it
even more clear ?
<para>
It is also possible that segments end up with very different
sizes even when
no TOAST is involved, as there is no guarantees that pages
are not unevenly
bloated in a real production database up to a point where
some pagese may
have no live tuples in them.
</para>
---
Hannu
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-01-28 21:33:48 |
| Message-ID: | CAMT0RQQuMNkgJBQhWwWKFX5oqPPCXYFYf=_iTKAFgWJHNOrJig@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, Jan 28, 2026 at 10:27 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Hi David
>
> About documentation :
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> >
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only.
On re-reading I finally understood what you meant - that the
chunking applies to only standard postgreSQL TAM called "heap"
Will add that as well. Would something like work
"This flag applies only to tables that use the standard PostgreSQL
Table Access Method (TAM) called "heap".
The tables that are using any custom TAM are dumped as if
--max-table-segment-pages was not set."
---
Hannu
| From: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-02-03 21:10:12 |
| Message-ID: | CAN4CZFMHr0gFx-DpKBy4SkVzuJC5f3S=pnkrMDC54OswBSuJNQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hello!
I did some testing with this patch, and I think there are some issues
during restoration:
1. Isn't there a possible race / scheduling mistake during restore
because of missing dependencies? The code now prints out "TABLE DATA
(pages %u:%u)", while the restore code checks for the explicit "TABLE
DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
few other places). This causes POST DATA to have no dependency on the
table data, and can be scheduled before we load all table data.
I was able to verify the scheduling issue with an index: the INDEX
part is scheduled too early, before all TABLE DATA completes, but then
locking prevents it from progressing, so everything completed fine in
the end. Even if that's guaranteed, which I'm not 100% sure of, it's
still based on luck and not proper logic, and takes up a slot (or
multiple), reducing parallelism.
2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
but it's not that simple, because then it causes truncation issues
during restore, which needs additional changes in the restore code. I
did a quick fix for that by adding an additional condition to the
created flag, and with that it seems to restore everything properly,
and with proper ordering, only starting index/constraint/etc after all
table data is completed. However this was definitely just a quick test
fix, this needs a proper better solution.
Other issues I see are more minor, but numerous:
3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
...")); Is this intended? Shouldn't these be behind some verbose
check, and maybe use info instead of warning?
4. The is_segment macro should have () around the use of tdiptr
5. There's still a 32000 magic constant, shouldn't that have some
descriptive name / explanatory comment?
6. formatting issues at multiple places, mostly missing spaces after
if/while/for statements
7. inconsistent error messages (not in range vs must be in range)
8. There's a remaining TODO that seems stale, current_chunk_start is
already uint64
9. typo: "the computed from pog_relation_size" -> "then computed from
pg_relation_size"
| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-02-12 06:13:22 |
| Message-ID: | CAFiTN-s50GtFf650TT9Fko+q5rc+xwm+x106ugB3BE7_xgGjPQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> v13 has added a proper test comparing original and restored table data
>
I was reviewing v13 and here are some initial comments I have
1. IMHO the commit message details about the work progress instead of
a high level idea about what it actually does and how.
Suggestion:
SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.
This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.
The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(start,1)'
AND ctid <= '(end,32000)') to extract specific chunks of the relation.
<more architecture details and limitation if any>
2.
+ pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
dopt.max_table_segment_pages);
+ break;
IMHO we don't need to place warning here while processing the input parameters
3.
+ printf(_(" --max-table-segment-pages=NUMPAGES\n"
+ " Number of main table pages
above which data is \n"
+ " copied out in chunks, also
determines the chunk size\n"));
Check the comment formatting, all the parameter description starts
with lower case, so better we start with "number" rather than "Number"
4.
+ if (is_segment(tdinfo))
+ {
+ appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+ if(tdinfo->startPage == 0)
+ appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
+ else if(tdinfo->endPage != InvalidBlockNumber)
+ appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+ tdinfo->startPage, tdinfo->endPage);
+ else
+ appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+ pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+ }
IMHO we should explain this chunking logic in the comment above this code block?
--
Regards,
Dilip Kumar
Google
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-28 10:59:18 |
| Message-ID: | CAMT0RQSwbGpf4WSCb41dcyvHUEYAnVTYxdJ6SqLYf7HiUag+TA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi Zsolt and Dilip,
Thanks for review and useful comments!
On Tue, Feb 3, 2026 at 10:10 PM Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> wrote:
>
> Hello!
>
> I did some testing with this patch, and I think there are some issues
> during restoration:
>
> 1. Isn't there a possible race / scheduling mistake during restore
> because of missing dependencies? The code now prints out "TABLE DATA
> (pages %u:%u)", while the restore code checks for the explicit "TABLE
> DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
> few other places). This causes POST DATA to have no dependency on the
> table data, and can be scheduled before we load all table data.
I have resolved this by adding a second array to the reverse dependencies
mechanism in buildTocEntryArrays() for chunked dump where I collect arrays
of ids in AH->tableDataChunkIds.
For this I extracted the list management part from DumpableObject
typedef struct _DependencyList
{
DumpId *dependencies; /* dumpIds of objects this one depends on */
int nDeps; /* number of valid dependencies */
int allocDeps; /* allocated size of dependencies[] */
} DependencyList;
And added addStandaloneDependency() based addObjectDependency()
I simplified it to always use realloc, as it can handle the NULL case
void
addStandaloneDependency(DependencyList *dobj, DumpId refId)
{
if (dobj->nDeps >= dobj->allocDeps)
{
dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
dobj->dependencies = pg_realloc_array(dobj->dependencies,
DumpId, dobj->allocDeps);
}
dobj->dependencies[dobj->nDeps++] = refId;
}
And then I use AH->tableDataChunkIds in repoint_table_dependencies() to
- replace the dependency on table def with dependency on first chunk
- add the remaining cunks at the end of dependency list.
> I was able to verify the scheduling issue with an index: the INDEX
> part is scheduled too early, before all TABLE DATA completes, but then
> locking prevents it from progressing, so everything completed fine in
> the end. Even if that's guaranteed, which I'm not 100% sure of, it's
> still based on luck and not proper logic, and takes up a slot (or
> multiple), reducing parallelism.
>
> 2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
> but it's not that simple, because then it causes truncation issues
> during restore, which needs additional changes in the restore code. I
> did a quick fix for that by adding an additional condition to the
> created flag, and with that it seems to restore everything properly,
> and with proper ordering, only starting index/constraint/etc after all
> table data is completed. However this was definitely just a quick test
> fix, this needs a proper better solution.
>
> Other issues I see are more minor, but numerous:
I collect the chunk dependencies in a separate array, which
should solve the truncation issue.
Can you advise a good check to add to tap tests for verifying?
> 3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
> ...")); Is this intended? Shouldn't these be behind some verbose
> check, and maybe use info instead of warning?
This left in for easing initial reviewing. I have either removed them
or turned them into pg_log_debug()
> 4. The is_segment macro should have () around the use of tdiptr
Thanks, fixed.
> 5. There's still a 32000 magic constant, shouldn't that have some
> descriptive name / explanatory comment?
I turned this into "ctid < (pagenr+1, 0)" for clarity and
futureproofing, as it is not entirely impossible that we could have
at some point more than 32000 items per page.
> 6. formatting issues at multiple places, mostly missing spaces after
> if/while/for statements
My hope was that the pre-release automatic formatting run takes care of
this.
I will eyeball to see if I find theem, but I don't think I have a good
way to detect them all.
Suggestions very much welcome!
> 7. inconsistent error messages (not in range vs must be in range)
> 8. There's a remaining TODO that seems stale, current_chunk_start is
> already uint64
Removed.
> 9. typo: "the computed from pog_relation_size" -> "then computed from
> pg_relation_size"
Fixed.
On Thu, Feb 12, 2026 at 7:13 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > v13 has added a proper test comparing original and restored table data
> >
> I was reviewing v13 and here are some initial comments I have
>
> 1. IMHO the commit message details about the work progress instead of
> a high level idea about what it actually does and how.
> Suggestion:
>
> SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
> table dumping.
>
> This patch introduces the ability to split large heap tables into segments
> based on a specified number of pages. These segments can then be dumped in
> parallel using the existing jobs infrastructure, significantly reducing
> the time required to dump very large tables.
>
> The implementation uses ctid-based range queries (e.g., WHERE ctid >=
> '(start,1)'
> AND ctid <= '(end,32000)') to extract specific chunks of the relation.
>
> <more architecture details and limitation if any>
SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.
This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.
This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid <= '(endPage+1,0)') to extract specific chunks of
the relation.
This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions
The patch only supports "heap" access method as others may not even have the
ctid column
> 2.
> + pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
> dopt.max_table_segment_pages);
> + break;
>
> IMHO we don't need to place warning here while processing the input parameters
Either removed or turned to pg_log_debug()
> 3.
> + printf(_(" --max-table-segment-pages=NUMPAGES\n"
> + " Number of main table pages
> above which data is \n"
> + " copied out in chunks, also
> determines the chunk size\n"));
>
> Check the comment formatting, all the parameter description starts
> with lower case, so better we start with "number" rather than "Number"
Fixed
> 4.
> + if (is_segment(tdinfo))
> + {
> + appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
> + if(tdinfo->startPage == 0)
> + appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
> + else if(tdinfo->endPage != InvalidBlockNumber)
> + appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
> + tdinfo->startPage, tdinfo->endPage);
> + else
> + appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
> + pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
> + }
>
> IMHO we should explain this chunking logic in the comment above this code block?
I added the comment.
I also changed the chunk end logic to "ctid < '(LastPage+1,0)'" for clarity and
future-proofing.
----
Best Regards
Hannu
| Attachment | Content-Type | Size |
|---|---|---|
| v14-0001-SUBJECT-Add-max-table-segment-pages-option-to-pg.patch | application/x-patch | 27.9 KB |
| From: | Michael Banck <mbanck(at)gmx(dot)net> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Hannu Krosing <hannuk(at)google(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-28 11:26:07 |
| Message-ID: | 20260328112607.GA3202@p46.dedyn.io;lightning.p46.dedyn.io |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hi,
On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> Perhaps --max-table-segment-pages is a better name than
> --huge-table-chunk-pages as it's quite subjective what the minimum
> number of pages required to make a table "huge".
I'm not sure that's better - without looking at the documentation,
people might confuse segment here with the 1GB split of tables into
segments. As pg_dump is a very common and basic user tool, I don't think
implementation details like pages/page sizes and blocks should be part
of its UX.
Can't we just make it a storage size, like '10GB' and then rename it to
--table-parallel-threshold or something? I agree it's bikeshedding, but
I personally don't like either --max-table-segment-pages or
--huge-table-chunk-pages.
Michael
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Michael Banck <mbanck(at)gmx(dot)net> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-28 15:32:23 |
| Message-ID: | CAMT0RQRtLwi_CrOcD7KxYL0Gm1nGXb-HWmerVg=ajEs6JP7m+w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
The issue is that currently the value is given in "main table pages"
and it would be somewhat deceptive, or at least confusing, to try to
express this in any other unit.
As I explained in the commit message:
---------8<-------------------8<-------------------8<----------------
This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
---------8<-------------------8<-------------------8<----------------
And I can think of no cheap and reliable way to change that equation.
I'll be very happy if you have any good ideas for either improving the
flag name, or even propose a way to better estimate the resulting dump
file size so we could give the chunk size in better units
---
Hannu
On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <mbanck(at)gmx(dot)net> wrote:
>
> Hi,
>
> On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > Perhaps --max-table-segment-pages is a better name than
> > --huge-table-chunk-pages as it's quite subjective what the minimum
> > number of pages required to make a table "huge".
>
> I'm not sure that's better - without looking at the documentation,
> people might confuse segment here with the 1GB split of tables into
> segments. As pg_dump is a very common and basic user tool, I don't think
> implementation details like pages/page sizes and blocks should be part
> of its UX.
>
> Can't we just make it a storage size, like '10GB' and then rename it to
> --table-parallel-threshold or something? I agree it's bikeshedding, but
> I personally don't like either --max-table-segment-pages or
> --huge-table-chunk-pages.
>
>
> Michael
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Michael Banck <mbanck(at)gmx(dot)net> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-28 15:33:59 |
| Message-ID: | CAMT0RQTe4Zr=rdcKMJj-=c7CH0PJh=ZPk=xOU98+M7p9-D+Yew@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
The above
"Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
should read
"Or it can be almost 200 GB *for a single page* if the page has just
pointers to 1GB TOAST items."
On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> The issue is that currently the value is given in "main table pages"
> and it would be somewhat deceptive, or at least confusing, to try to
> express this in any other unit.
>
> As I explained in the commit message:
>
> ---------8<-------------------8<-------------------8<----------------
> This --max-table-segment-pages number specifically applies to main table
> pages which does not guarantee anything about output size.
> The output could be empty if there are no live tuples in the page range.
> Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> ---------8<-------------------8<-------------------8<----------------
>
> And I can think of no cheap and reliable way to change that equation.
>
> I'll be very happy if you have any good ideas for either improving the
> flag name, or even propose a way to better estimate the resulting dump
> file size so we could give the chunk size in better units
>
> ---
> Hannu
>
>
>
>
>
> On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <mbanck(at)gmx(dot)net> wrote:
> >
> > Hi,
> >
> > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > Perhaps --max-table-segment-pages is a better name than
> > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > number of pages required to make a table "huge".
> >
> > I'm not sure that's better - without looking at the documentation,
> > people might confuse segment here with the 1GB split of tables into
> > segments. As pg_dump is a very common and basic user tool, I don't think
> > implementation details like pages/page sizes and blocks should be part
> > of its UX.
> >
> > Can't we just make it a storage size, like '10GB' and then rename it to
> > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > I personally don't like either --max-table-segment-pages or
> > --huge-table-chunk-pages.
> >
> >
> > Michael
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Michael Banck <mbanck(at)gmx(dot)net> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-29 21:49:38 |
| Message-ID: | CAMT0RQSfEvhZB_3_UZamm+Oi9O7+4+XWi-OOL0NnctyvLZRS0g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Fixing a off-by-one error in copying over dependencies
On Sat, Mar 28, 2026 at 4:33 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> The above
>
> "Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
>
> should read
>
> "Or it can be almost 200 GB *for a single page* if the page has just
> pointers to 1GB TOAST items."
>
>
> On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > The issue is that currently the value is given in "main table pages"
> > and it would be somewhat deceptive, or at least confusing, to try to
> > express this in any other unit.
> >
> > As I explained in the commit message:
> >
> > ---------8<-------------------8<-------------------8<----------------
> > This --max-table-segment-pages number specifically applies to main table
> > pages which does not guarantee anything about output size.
> > The output could be empty if there are no live tuples in the page range.
> > Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> > ---------8<-------------------8<-------------------8<----------------
> >
> > And I can think of no cheap and reliable way to change that equation.
> >
> > I'll be very happy if you have any good ideas for either improving the
> > flag name, or even propose a way to better estimate the resulting dump
> > file size so we could give the chunk size in better units
> >
> > ---
> > Hannu
> >
> >
> >
> >
> >
> > On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <mbanck(at)gmx(dot)net> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > > Perhaps --max-table-segment-pages is a better name than
> > > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > > number of pages required to make a table "huge".
> > >
> > > I'm not sure that's better - without looking at the documentation,
> > > people might confuse segment here with the 1GB split of tables into
> > > segments. As pg_dump is a very common and basic user tool, I don't think
> > > implementation details like pages/page sizes and blocks should be part
> > > of its UX.
> > >
> > > Can't we just make it a storage size, like '10GB' and then rename it to
> > > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > > I personally don't like either --max-table-segment-pages or
> > > --huge-table-chunk-pages.
> > >
> > >
> > > Michael
| Attachment | Content-Type | Size |
|---|---|---|
| v15-0001-Add-max-table-segment-pages-option-to-pg.patch | application/x-patch | 27.9 KB |
| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, Michael Banck <mbanck(at)gmx(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-30 17:32:34 |
| Message-ID: | CAMT0RQT9v8sKiMK+BwMLNENrrBTpxnhTUOwmsvG9-YohAHQTjA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Now the dependencies on chunks should also behave correctly
On Sun, Mar 29, 2026 at 11:49 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Fixing a off-by-one error in copying over dependencies
>
>
> On Sat, Mar 28, 2026 at 4:33 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> >
> > The above
> >
> > "Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
> >
> > should read
> >
> > "Or it can be almost 200 GB *for a single page* if the page has just
> > pointers to 1GB TOAST items."
> >
> >
> > On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
> > >
> > > The issue is that currently the value is given in "main table pages"
> > > and it would be somewhat deceptive, or at least confusing, to try to
> > > express this in any other unit.
> > >
> > > As I explained in the commit message:
> > >
> > > ---------8<-------------------8<-------------------8<----------------
> > > This --max-table-segment-pages number specifically applies to main table
> > > pages which does not guarantee anything about output size.
> > > The output could be empty if there are no live tuples in the page range.
> > > Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> > > ---------8<-------------------8<-------------------8<----------------
> > >
> > > And I can think of no cheap and reliable way to change that equation.
> > >
> > > I'll be very happy if you have any good ideas for either improving the
> > > flag name, or even propose a way to better estimate the resulting dump
> > > file size so we could give the chunk size in better units
> > >
> > > ---
> > > Hannu
> > >
> > >
> > >
> > >
> > >
> > > On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <mbanck(at)gmx(dot)net> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > > > Perhaps --max-table-segment-pages is a better name than
> > > > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > > > number of pages required to make a table "huge".
> > > >
> > > > I'm not sure that's better - without looking at the documentation,
> > > > people might confuse segment here with the 1GB split of tables into
> > > > segments. As pg_dump is a very common and basic user tool, I don't think
> > > > implementation details like pages/page sizes and blocks should be part
> > > > of its UX.
> > > >
> > > > Can't we just make it a storage size, like '10GB' and then rename it to
> > > > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > > > I personally don't like either --max-table-segment-pages or
> > > > --huge-table-chunk-pages.
> > > >
> > > >
> > > > Michael
| Attachment | Content-Type | Size |
|---|---|---|
| v16-0001-Add-max-table-segment-pages-option-to-pg_dump-fo.patch | application/x-patch | 29.5 KB |
| From: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Michael Banck <mbanck(at)gmx(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: Patch: dumping tables data in multiple chunks in pg_dump |
| Date: | 2026-03-30 21:32:22 |
| Message-ID: | CAN4CZFN=iKYAhaki0eC-ADutk-aB3B5jOV26FrTewWouQiD6jQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Lists: | pgsql-hackers |
Hello!
A simple test causes an assertion failure in my testing, dependency
counting still doesn't seem to work correctly:
pg_restore: >...>/pg_backup_archiver.c:5207: reduce_dependencies:
Assertion `otherte->depCount > 0' failed.
Without assertions it results in data loss.
004_pg_dump_parallel also showcases the issue in my testing.
But simple manual testing also confirms it:
1. create some data
CREATE TABLE tplain (id int UNIQUE);
INSERT INTO tplain SELECT x FROM generate_series(1,1000) x;
2. create a dump
dump with --max-table-segment-pages=2
3. try to restore
restore with --jobs=3