Re: Optional skipping of unchanged relations during ANALYZE?

Lists: pgsql-hackers
From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-20 10:24:48
Message-ID: CAE2r8H7+eACA+QxisuPY1jveA=YLuH7CE8F3TedmZfsKdYxOeA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Hackers,

I’m planning to work on a small improvement around ANALYZE behavior and
wanted to ask the community for guidance before proceeding.

Currently, when ANALYZE is run over many relations, it analyzes all
eligible tables even if some of them have not changed since their last
ANALYZE. In environments with many mostly-static tables, this can lead
to repeated work with little benefit.

I’m considering working on an optional mode where ANALYZE would skip
relations that have not been modified since their last analyze, based
on existing pg_stat counters (for example, mod_since_analyze = 0).

Before moving forward, I’d like to understand:

--whether this aligns with PostgreSQL’s statistics and planner design,

--if there are reasons ANALYZE should always re-run even for unchanged
relations,

--and whether such behavior would be acceptable if it were strictly
opt-in.

Any feedback, concerns, or pointers would be very helpful.

Thanks,
Vasuki M
C-DAC,Chennai


From: Christoph Berg <myon(at)debian(dot)org>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-20 10:46:25
Message-ID: aW9dAW0GiLbMs6L9@msg.df7cb.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Re: VASUKI M
> I’m considering working on an optional mode where ANALYZE would skip
> relations that have not been modified since their last analyze, based
> on existing pg_stat counters (for example, mod_since_analyze = 0).

Make sure that doesn't skip tables that were never analyzed before.

Christoph


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-20 11:02:46
Message-ID: CAE2r8H7Fs-KyAEXH0j9MKW_5axFAkuhWC0htM0k03-Zw0TV9fQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for pointing that out.

On Tue, Jan 20, 2026 at 4:16 PM Christoph Berg <myon(at)debian(dot)org> wrote:

> Re: VASUKI M
> > I’m considering working on an optional mode where ANALYZE would skip
> > relations that have not been modified since their last analyze, based
> > on existing pg_stat counters (for example, mod_since_analyze = 0).
>
> Make sure that doesn't skip tables that were never analyzed before.
>
>
Yes, the intention is that SMART ANALYZE would not skip relations that have
never been analyzed before.
The skip decision is based on pg_stat entries, so relations without
existing statistics will still be analyzed normally.

I’ll make sure this behavior is clear and covered when I post the patch.

Thanks,
Vasuki M
C-DAC,Chennai


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, myon(at)debian(dot)org
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-20 20:08:23
Message-ID: cc3fde9a-6989-4e9f-8f74-92891c864757@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi everyone,

On 20.01.2026 13:24, VASUKI M wrote:
> I’m planning to work on a small improvement around ANALYZE behavior and
> wanted to ask the community for guidance before proceeding.

Thanks for working on this — it indeed looks like it could reduce the
time spent executing ANALYZE.

>
> Currently, when ANALYZE is run over many relations, it analyzes all
> eligible tables even if some of them have not changed since their last
> ANALYZE. In environments with many mostly-static tables, this can lead
> to repeated work with little benefit.
>
> I’m considering working on an optional mode where ANALYZE would skip
> relations that have not been modified since their last analyze, based
> on existing pg_stat counters (for example, mod_since_analyze = 0).

We should consider n_mod_since_analyze as well.

>
> Before moving forward, I’d like to understand:
>
> --whether this aligns with PostgreSQL’s statistics and planner design,
>
> --if there are reasons ANALYZE should always re-run even for unchanged
> relations,
>
> --and whether such behavior would be acceptable if it were strictly
> opt-in.
>
> Any feedback, concerns, or pointers would be very helpful.

One concern that comes to mind is changes in statistics targets. For
example, statistics may have been collected with
default_statistics_target = 100, and later either
default_statistics_target or a per-column statistics target is increased
(e.g., to 200).

As far as I know, we currently do not track which statistics target was
used when the existing statistics were collected. If someone knows a
reliable way to determine this, please correct me.

If we cannot determine that, we would need to decide whether such
relations should still be skipped

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: Robert Treat <rob(at)xzilla(dot)net>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-20 20:26:52
Message-ID: CABV9wwNYQ6V4-kKzPbJ0fGtgrq_+0z1kLiWQ8yqqDL-hEmphdQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 20, 2026 at 5:24 AM VASUKI M <vasukianand0119(at)gmail(dot)com> wrote:
>
> Hi Hackers,
>
> I’m planning to work on a small improvement around ANALYZE behavior and
> wanted to ask the community for guidance before proceeding.
>
> Currently, when ANALYZE is run over many relations, it analyzes all
> eligible tables even if some of them have not changed since their last
> ANALYZE. In environments with many mostly-static tables, this can lead
> to repeated work with little benefit.
>
> I’m considering working on an optional mode where ANALYZE would skip
> relations that have not been modified since their last analyze, based
> on existing pg_stat counters (for example, mod_since_analyze = 0).
>
> Before moving forward, I’d like to understand:
>
> --whether this aligns with PostgreSQL’s statistics and planner design,
>

I think it makes sense generally, and one could maybe argue that it
should be the default behavior; have you done any research into why it
doesn't behave that way already?

> --if there are reasons ANALYZE should always re-run even for unchanged
> relations,
>

Given ANALYZE does a random sample, on rare occasions it can be
valuable to re-run analyze to get a better sample than whatever
statistics were obtained previously, even in the case the data itself
does not change. I suppose more likely scenarios would be modification
of default_statistics_target either at server or table level
(adding/removing), but the point is there are scenarios where you
might want to rerun it, so we do need to support both behaviors.

> --and whether such behavior would be acceptable if it were strictly
> opt-in.
>

Given my above, it does have to be something that can be turned on or
off, so even if we don't know which is the best default behavior, it
makes sense to start by doing it in a way that is optional.

Robert Treat
https://xzilla.net


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, ilya(dot)evdokimov(at)tantorlabs(dot)com, Christoph Berg <myon(at)debian(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 06:29:24
Message-ID: CAE2r8H7jVmsLaWi19ZuxmwR-8UjtN6Ejv7YmtDrPinCYtE8cUA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Thanks a lot for the thoughtful feedback.

The points about statistics targets and random sampling make sense. In
particular, I agree that even without data changes, re-running ANALYZE
can still be beneficial (for example after increasing
default_statistics_target or per-column targets, or simply to obtain a
different sample).

Given that, my intention is to keep this strictly as an opt-in
behavior, so that existing semantics are unchanged unless the user
explicitly requests it. In the current prototype, tables that have
never been analyzed before are not skipped, and SMART only considers
relations that already have statistics.

Regarding statistics targets, since PostgreSQL does not currently track
which target was used to collect existing statistics, SMART ANALYZE
would not attempt to account for target changes. I plan to document
this limitation clearly so users understand the trade-off when opting
into this mode [for now later will look into it].

I’ll take this feedback into account while cleaning up the patch and
documentation, and will follow up with a v1 proposal once ready.

Thanks again for the guidance.

Regards,
Vasuki M
C-DAC,Chennai


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Christoph Berg <myon(at)debian(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 06:36:03
Message-ID: CAApHDvqULEFpcaaRBnQt-bsCa5-+mAzDPgK3Qd6C_KXVEAFgBw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 21 Jan 2026 at 00:02, VASUKI M <vasukianand0119(at)gmail(dot)com> wrote:
> On Tue, Jan 20, 2026 at 4:16 PM Christoph Berg <myon(at)debian(dot)org> wrote:
>> Make sure that doesn't skip tables that were never analyzed before.
>
> Yes, the intention is that SMART ANALYZE would not skip relations that have never been analyzed before.
> The skip decision is based on pg_stat entries, so relations without existing statistics will still be analyzed normally.

If doing this, you would also need to make special consideration for
partitioned tables, as n_mod_since_analyze won't change for those
directly, but it might have changed for their partitions.

David


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Christoph Berg <myon(at)debian(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 06:43:55
Message-ID: CAE2r8H6b5UTn_0Fkd3fRpN3QwVogxEJKXrDy3uBm_YfQPCiTsw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi David,

Thanks for calling this out — yes, I agree this is an important case.

Partitioned tables are already something I’m considering separately, since
the parent’s n_mod_since_analyze does not reflect changes made in the
partitions. The intention is not to skip analysis of partitions just because
the partitioned parent itself shows no modifications.

For now, my approach is deliberately limited to using the statistics that
are
already available via pg_stat and making skip decisions only where those
statistics are meaningful and reliable.

That also means that for the initial version, I’m not trying to introduce
special handling for cases like foreign tables or system catalogs beyond
what
existing statistics already provide. Where statistics are missing, unclear,
or potentially misleading, the conservative behavior would be to fall back
to running ANALYZE as usual.

Thanks again for the feedback.

Regards,
Vasuki M
C-DAC,Chennai.

On Wed, Jan 21, 2026 at 12:06 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Wed, 21 Jan 2026 at 00:02, VASUKI M <vasukianand0119(at)gmail(dot)com> wrote:
> > On Tue, Jan 20, 2026 at 4:16 PM Christoph Berg <myon(at)debian(dot)org> wrote:
> >> Make sure that doesn't skip tables that were never analyzed before.
> >
> > Yes, the intention is that SMART ANALYZE would not skip relations that
> have never been analyzed before.
> > The skip decision is based on pg_stat entries, so relations without
> existing statistics will still be analyzed normally.
>
> If doing this, you would also need to make special consideration for
> partitioned tables, as n_mod_since_analyze won't change for those
> directly, but it might have changed for their partitions.
>
> David
>


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Christoph Berg <myon(at)debian(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 07:47:33
Message-ID: c57f96fc-e3de-4690-a4c7-bda6af16aa58@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Another concern with skipping ANALYZE on unchanged tables is extended
statistics.

If CREATE/ALTER STATISTICS is executed, it would still be desirable for
ANALYZE to collect the newly statistics and extended ones, even if the
table data itself has not changed.

What do you think about this?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Christoph Berg <myon(at)debian(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 09:37:01
Message-ID: CAE2r8H5Gsfs4REdo=Phj2GUv18Q5fsy-CBq0jbXyjrK1V-P5ag@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Hackers,

As discussed in the recent thread, I’m sharing an initial v1 patch
introducing an opt-in SMART mode for ANALYZE.

When ANALYZE (SMART) is specified, relations that have not been modified
since their last analyze are skipped, based on existing pg_stat counters
(n_mod_since_analyze = 0). Relations without existing statistics are
still analyzed normally. The default ANALYZE behavior remains unchanged.

The primary goal of this patch is to reduce unnecessary work when
running ANALYZE over many mostly-static tables, while keeping the
behavior strictly opt-in.

Scope of this v1 patch:
- Uses existing pg_stat statistics only
- Does not skip relations that were never analyzed before
- Includes regression tests demonstrating that only modified tables are
re-analyzed
- Partitioned tables, inheritance, foreign tables,extended statistics and
other edge cases are intentionally not handled yet; I plan to look into
those in follow-up work based on feedback

Example usage / how to observe behavior:

SET client_min_messages = debug1;

ANALYZE (SMART);
ANALYZE (SMART, VERBOSE);

ANALYZE (SMART) table1;
ANALYZE (SMART) table1, table2;
VACUUM(SMART);

Thanks for your time and review.

Best regards,
Vasuki M
C-DAC,Chennai

Attachment Content-Type Size
v1-0001-ANALYZE-add-optional-smart-mode-to-skip-unchanged-relations.patch text/x-patch 8.3 KB

From: Christoph Berg <myon(at)debian(dot)org>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 09:44:45
Message-ID: aXCgDaz9y2ssA84N@msg.df7cb.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Re: VASUKI M
> VACUUM(SMART);

IMHO it was a historical mistake to combine VACUUM and ANALYZE into a
single command. We should not add any more options on that
combination. If people want to pass options to ANALYZE, they should
call ANALYZE and not VACUUM.

Christoph


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 09:47:44
Message-ID: CAE2r8H5SpHmfHQXmv9QMfKAMdw=F3fq0MLbLvGxH670GhwQt0w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sorry,forgot to mention that it will produce error as 'SMART requires
ANALYZE'

It's just a checking command :)

-Vasu

On Wed, Jan 21, 2026 at 3:14 PM Christoph Berg <myon(at)debian(dot)org> wrote:

> Re: VASUKI M
> > VACUUM(SMART);
>
> IMHO it was a historical mistake to combine VACUUM and ANALYZE into a
> single command. We should not add any more options on that
> combination. If people want to pass options to ANALYZE, they should
> call ANALYZE and not VACUUM.
>
> Christoph
>


From: Christoph Berg <myon(at)debian(dot)org>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 09:50:59
Message-ID: aXChg6ODQngoz9-j@msg.df7cb.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Re: VASUKI M
> Sorry,forgot to mention that it will produce error as 'SMART requires
> ANALYZE'

SMART is also a terribly non-descriptive name. How about CHANGED_ONLY?

Christoph


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 09:56:31
Message-ID: CAE2r8H5TZ7txx95PcutDn_LEP2zcim8P6Wctpzdc+ZgO6JSqdw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 21, 2026 at 3:21 PM Christoph Berg <myon(at)debian(dot)org> wrote:

> SMART is also a terribly non-descriptive name. How about CHANGED_ONLY?
>

Yeah i agree,as of now i am focusing on concept workflow will change name
in next versions of patch.

Regards,
Vasuki M
C-DAC,Chennai.


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-21 10:49:18
Message-ID: 0ab28b49-29ca-4b71-82b0-dd45c9408f5f@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 21.01.2026 12:56, VASUKI M wrote:

> On Wed, Jan 21, 2026 at 3:21 PM Christoph Berg <myon(at)debian(dot)org> wrote:
>
> SMART is also a terribly non-descriptive name. How about CHANGED_ONLY?
>
>
>  Yeah i agree,as of now i am focusing on concept workflow will change
> name in next versions of patch.
>
> Regards,
> Vasuki M
> C-DAC,Chennai.

So do I

It seems to me that the condition for relations that have never had
statistics collected might be incorrect. If I'm reading this correctly,
shouldn't this be checking 'tabstat->mod_since_analyze > 0' instead of
'tabstat->mod_since_analyze == 0'? I tested it on simple query:

CREATE TABLE t (i INT, j INT);
INSERT INTO t SELECT i/10, i/100 FROM generate_series(1, 1000000) i;
ANALYZE (SMART) t;
SELECT COUNT(*) FROM pg_stats WHERE tablename = 't';
 count
-------
     0
(1 row)

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-22 09:19:00
Message-ID: CAE2r8H7BwgYX61eiz4XrkWThxNhKKbp9N3Vty9HCrT3JTaVmzA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi llia,

On Wed, Jan 21, 2026 at 4:19 PM Ilia Evdokimov <
ilya(dot)evdokimov(at)tantorlabs(dot)com> wrote:

> On 21.01.2026 12:56, VASUKI M wrote:
>
> On Wed, Jan 21, 2026 at 3:21 PM Christoph Berg <myon(at)debian(dot)org> wrote:
>
>> SMART is also a terribly non-descriptive name. How about CHANGED_ONLY?
>>
>
> Yeah i agree,as of now i am focusing on concept workflow will change name
> in next versions of patch.
>
> Regards,
> Vasuki M
> C-DAC,Chennai.
>
> So do I
>
>
> It seems to me that the condition for relations that have never had
> statistics collected might be incorrect. If I'm reading this correctly,
> shouldn't this be checking 'tabstat->mod_since_analyze > 0' instead of
> 'tabstat->mod_since_analyze == 0'? I tested it on simple query:
>
> CREATE TABLE t (i INT, j INT);
> INSERT INTO t SELECT i/10, i/100 FROM generate_series(1, 1000000) i;
> ANALYZE (SMART) t;
> SELECT COUNT(*) FROM pg_stats WHERE tablename = 't';
> count
> -------
> 0
> (1 row)
>

This passes now :)

As discussed in the recent thread, I am sharing a revised v2 patch that
introduces an optional SMART mode for ANALYZE.

When ANALYZE (SMART) is specified, relations are skipped if:
- they have been analyzed before (either manually or via autovacuum),
and
- they have not been modified since their last analyze
(n_mod_since_analyze = 0, based on pg_stat statistics).

Relations that have never been analyzed before are always analyzed
normally. The default ANALYZE behavior remains unchanged unless SMART
is explicitly requested.

The motivation is to reduce unnecessary ANALYZE work in databases with
a large number of mostly-static tables, while keeping the behavior
strictly opt-in.

Changes and clarifications in v2:
- Tables that have never been analyzed are never skipped
(checked via last_analyze_time / last_autoanalyze_time)
- Skip decisions rely only on pg_stat_user_tables counters
- The skip condition is n_mod_since_analyze == 0
- Regression tests are added to demonstrate:
-->SMART ANALYZE does not skip never-analyzed tables
-->Only modified tables are re-analyzed

This patch intentionally limits its scope to regular relations and
existing pg_stat statistics only. Partitioned tables, inheritance,
foreign tables, extended statistics, and statistics target changes are
not handled yet and can be considered in follow-up work based on
feedback.

The patch applies cleanly on current master and passes:
make distclean
./configure
make -j$(nproc)
make install
make check

See this:

analyze_test=# create table sa6 (id int);
CREATE TABLE
Time: 3.917 ms
analyze_test=# analyze(smart) sa6;
DEBUG: ANALYZE processing relation "sa6" (OID 131324)
ANALYZE
Time: 0.585 ms
analyze_test=# SELECT count(*) > 0 AS stats_created
FROM pg_stats
WHERE tablename = 'sa6';
stats_created
---------------
f
(1 row)

Time: 0.894 ms
analyze_test=# SELECT relname,
last_analyze,
n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'sa6';
relname | last_analyze | n_mod_since_analyze
---------+----------------------------------+---------------------
sa6 | 2026-01-22 10:35:23.005045+05:30 | 0
(1 row)

The empty table doesn't have any stats to show as pg_stat is column level
statistics;
these are created when rows exists ,it has 0 rows to make samples,most
common used values,etc,..so no data distribution

But when value is inserted ,

analyze_test=# CREATE TABLE sa4 (i int);
CREATE TABLE
Time: 10.290 ms
analyze_test=# INSERT INTO sa4 SELECT generate_series(1,10);
INSERT 0 10
Time: 45.373 ms
analyze_test=# analyze(smart) sa4;
DEBUG: ANALYZE processing relation "sa4" (OID 131310)
ANALYZE
Time: 47.771 ms
analyze_test=# SELECT count(*) > 0 AS stats_created
FROM pg_stats
WHERE tablename = 'sa4';
stats_created
---------------
t
(1 row)

Time: 0.945 ms

I would appreciate feedback on the overall approach.

Thanks for your time and review.

--
Best regards,
Vasuki M
C-DAC,Chennai

Attachment Content-Type Size
v2-0001-ANALYZE-Introduce-an-opt-in-SMART-option.patch text/x-patch 9.7 KB

From: Robert Treat <rob(at)xzilla(dot)net>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-22 19:42:52
Message-ID: CAJSLCQ2ZTLng_4vF9O74oxaOg_cAxD6=ixLMoW_+5CxCkNtXrA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 21, 2026 at 4:44 AM Christoph Berg <myon(at)debian(dot)org> wrote:
>
> Re: VASUKI M
> > VACUUM(SMART);
>
> IMHO it was a historical mistake to combine VACUUM and ANALYZE into a
> single command. We should not add any more options on that
> combination. If people want to pass options to ANALYZE, they should
> call ANALYZE and not VACUUM.
>

I don't know if I go that far, but if you are saying that you dont
think "smart analyze" should be an option for vacuum runs, I can get
onboard with that. We don't really know what any given vacuum run will
do with regards to the table, but if we are shuffling data / storage
around, it probably makes sense to update statistics info along the
way.

Robert Treat
https://xzilla.net


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-22 20:18:14
Message-ID: CAA5RZ0s-akwQTy5bBTVBgSHB9Sy-6jS7eFZXqLcTbvL13pb8eA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> I would appreciate feedback on the overall approach.

I did not read through the patch in detail but by looking at the commit
message:

"A relation is skipped only if:
- it has been analyzed before (manual or auto-analyze), and
- n_mod_since_analyze == 0

Relations that have never been analyzed are always analyzed normally.
The default ANALYZE behavior is unchanged unless SMART is explicitly
specified.
"

I can't help but think that this SMART option is not as smart as it
should be to actually
be valuable.

I agree that we should never skip a table that has never been
analyzed. My concern
is that n_mod_since_analyze == 0 is not very useful. What if I modify
1 tuple? does
that really justify an ANALYZE to run on the table? Shouldn't the
decision be driven based
on some threshold calculation; similar to how autoanalyze makes the decision?

--
Sami Imseih
Amazon Web Services (AWS)


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>, VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-22 21:22:04
Message-ID: 6b5d7f83-7ac8-47b1-ab7f-0040b65ad02a@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

I spent some more time thinking about this new option.

On 22.01.2026 23:18, Sami Imseih wrote:
> I can't help but think that this SMART option is not as smart as it
> should be to actually
> be valuable.
>
> I agree that we should never skip a table that has never been
> analyzed. My concern
> is that n_mod_since_analyze == 0 is not very useful.

IMO, for the purpose of ensuring that we never skip relations that have
never been analyzed, checking last_analyze / last_autoanalyze being NULL
seems sufficient and reliable.

> What if I modify
> 1 tuple? does
> that really justify an ANALYZE to run on the table? Shouldn't the
> decision be driven based
> on some threshold calculation; similar to how autoanalyze makes the decision?

The primary purpose of ANALYZE is to allow users to explicitly
rebuildstatistics when they believe it is necessary. When a user
specifiesparticular tables or columns (e.g., ANALYZE table; or ANALYZE
table(i, j); ), I would not expect them to use this newoption - in that
case, the intent is usually to force statistics to berecollected.

However, the situation looks different when ANALYZE is run across
theentire database (i.e., plain ANALYZE;). In that context, havingan
option to skip relations that are known not to have changed sincetheir
last analyze seems useful, as it avoids doing work that is
clearlyunnecessary. That said, I think we still need to be precise about
what exactly "relations that have not changed" means in this context, in
order to understand where statistics would and would not be rebuilt. In
particular, relying solely on n_mod_since_analyze == 0 does not seem
sufficient, as we have already discussed several cases where ANALYZE may
still be required even without direct data modifications (e.g.
partitioned tables, inheritance, foreign tables, extended statistics, etc.)

About thresholds: I’m not convinced they make much sense for
manualANALYZE. autovacuum already exists to decide when statistics need
tobe refreshed based on thresholds, and if those conditions are met,
itwill run automatically. I’m not sure there is much value in
duplicatingthat logic for explicit ANALYZE commands.

What do you think?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-22 22:59:39
Message-ID: CAA5RZ0tbcydhZ5PwqLMXpMRKuWkwZtvxAiiq6=UH-j2Z7boC2w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I can't help but think that this SMART option is not as smart as it
> should be to actually
> be valuable.
>
> I agree that we should never skip a table that has never been
> analyzed. My concern
> is that n_mod_since_analyze == 0 is not very useful.
>
> IMO, for the purpose of ensuring that we never skip relations that have never been analyzed,
> checking last_analyze / last_autoanalyze being NULL seems sufficient and reliable.

edba754f052 introduced --missing-stats-only for vacuumdb. Although
this was intended
for pg_upgrade, it does note in the commit message that "it might be
useful in other situations"
Perhaps, this is one of the situations.

So, instead of a smart mode, maybe we should be thinking about an
ANALYZE (missing_stats_only) option that follows what is done in
vacuumdb; and will skip tables that don't need to be analyzed.
Ultimately vacuumdb can just use this option.

The criteria for tables missing stats is more comprehensive than a simple
last_analyze / last_autoanalyze being NULL.

A followup commit 984d7165dd also mentions:

"
For v19, perhaps we could introduce a simple, inexpensive way to
discover which relations are missing statistics, such as a system
function or view with similar privilege requirements to ANALYZE.
Unfortunately, it is far too late for anything like that in v18.
"

What do you think?

--
Sami Imseih
Amazon Web Services (AWS)


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-23 06:33:27
Message-ID: CAE2r8H4+SoMrCXZx987em2VW5tR=N_0xtj28B8R6dzuLon=bzQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Thanks a lot for the detailed feedback — this has been very
helpful.Answering to all mails in one.

A few clarifications on intent and scope, and how this relates to the
points raised:

Autovacuum overlap
I agree there is some conceptual overlap with autovacuum’s analyze decision
logic. The intent here is not to replace or duplicate autovacuum
heuristics, but to reduce clearly redundant work during explicit ANALYZE
runs (especially plain ANALYZE; across the whole database). Autovacuum
already handles threshold-based decisions well; this option is meant to be
a lightweight, explicit opt-in for manual ANALYZE usage.

Thresholds vs n_mod_since_analyze
I agree that n_mod_since_analyze == 0 is a very simple condition and not
“smart” in the general sense. That is intentional for now. This option is
not trying to answer when statistics should be refreshed optimally, but
only to skip relations that are known to be unchanged since the last
analyze. If even a single tuple is modified, SMART ANALYZE will still
re-run, preserving conservative behavior.

Tables never analyzed
As Christoph and Ilia pointed out earlier, skipping tables that were never
analyzed would be incorrect. The current logic explicitly avoids that by
requiring last_analyze or last_autoanalyze to be present before skipping.
Tables without prior statistics are always analyzed.

Relation to vacuumdb --missing-stats-only
I agree this is related but slightly different in intent.
--missing-stats-only answers “does this table have any statistics at all?”,
while SMART ANALYZE answers “has this table changed since the last
statistics collection?”. Both seem useful, but they target different use
cases. I see SMART ANALYZE primarily as a performance optimization for
repeated manual ANALYZE runs on mostly-static schemas.

Extended statistics / partitions / inheritance
These are valid concerns. The current patch intentionally does not attempt
to handle extended statistics, partitioned tables, inheritance, foreign
tables, etc. I wanted to start with a minimal, explicit, and conservative
behavior for regular relations only. I agree these areas need careful
consideration before extending the logic further, and I plan to look into
them based on feedback.

VACUUM vs ANALYZE
I also agree with the concern about adding more options to VACUUM. The
current patch focuses on ANALYZE usage; I’m not proposing this as a VACUUM
option.

NAMING
Although as sami said this SMART is not smart enough as it should be , I
will change name accordingly in the further patches based on urs and
others opinion once it is decided.
Based on feedback, I’m happy to revise direction, naming, or scope before
taking this further.

Thanks again for the thoughtful discussion — really appreciate the guidance.

Best regards,
Vasuki M
C-DAC,Chennai.


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-23 10:44:18
Message-ID: 65da80c0-52fb-454e-b29e-b1d5a254ec38@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers


On 23.01.2026 09:33, VASUKI M wrote:
> Relation to vacuumdb --missing-stats-only
> I agree this is related but slightly different in intent.
> --missing-stats-only answers “does this table have any statistics at
> all?”, while SMART ANALYZE answers “has this table changed since the
> last statistics collection?”. Both seem useful, but they target
> different use cases. I see SMART ANALYZE primarily as a performance
> optimization for repeated manual ANALYZE runs on mostly-static schemas.

LGTM. Thanks to Sami for pointing this out.

It seems reasonable to start by introducing an option for plain ANALYZE
(without specifying tables or columns) that follows the same idea as
vacuumdb --missing-stats-only. While this flag was originally introduced
primarily to support pg_upgrade workflows, exposing similar
functionality at the ANALYZE level also seems useful on its own. That
would give us a clear and well-defined first step. At the SQL level, a
name such as ANALYZE (MISSING_STATS_ONLY) would be a good fit and remain
consistent with the vacuumdb option.

Thoughts?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-23 20:31:26
Message-ID: CAA5RZ0tYuhHapyVBTw8tVfrKp6fyS5YBTVdQhYGOcWFg-ERyFA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for the detailed summary!

It is important to point out that this feature is trying to do 2 distinct
things in 1 command. run analyze under when either one of these conditions
is true:

1/ Table has not been analyzed yet.
2/ Table has been modified.

> Thanks a lot for the detailed feedback — this has been very helpful.Answering to all mails in one.
>
> A few clarifications on intent and scope, and how this relates to the points raised:
>
> Autovacuum overlap
> I agree there is some conceptual overlap with autovacuum’s analyze decision logic.
> The intent here is not to replace or duplicate autovacuum heuristics, but to reduce

Yes, I agree with this.

> I agree that n_mod_since_analyze == 0 is a very simple condition
> and not “smart” in the general sense. That is intentional for now.
> This option is not trying to answer when statistics should be refreshed optimally,
> but only to skip relations that are known to be unchanged since the last analyze.
> If even a single tuple is modified, SMART ANALYZE will still re-run, preserving
> conservative behavior.

Yes, this is my concern. Why would I want to analyze if 1 row or a negligible
amount of rows are modified? I understand that this feature is trying to
keep the decision making very simple, but I think it's too simple to actually
be helpful in addressing the wasted effort of an ANALYZE command.

> Tables never analyzed
> As Christoph and Ilia pointed out earlier, skipping tables that were never analyzed would be incorrect.
> The current logic explicitly avoids that by requiring last_analyze or last_autoanalyze to be present
> before skipping. Tables without prior statistics are always analyzed.

I agree with this, but I think it's more than just tables that have
not been analyzed.
What if a new column is added after the last (auto)analyze. Would we not want to
trigger an analyze in that case?

> Relation to vacuumdb --missing-stats-only
> I agree this is related but slightly different in intent. --missing-stats-only
> answers “does this table have any statistics at all?”, while SMART ANALYZE
> answers “has this table changed since the last statistics collection?”. Both seem
> useful, but they target different use cases. I see SMART ANALYZE primarily
> as a performance optimization for repeated manual ANALYZE runs on mostly-static schemas.

SMART ANALYZE is trying to answer 2 questions "which table does not
have any statistics at all"
and "has this table changed since the last statistics collection?”, right?

So, maybe they need to be 2 separate options.

> Although as sami said this SMART is not smart enough as it should be ,
> I will change name accordingly in the further patches

Yup, I am not too fond of SMART in the name. Also, then name itself
is vague. SKIP_LOCKED and BUFFER_USAGE_LIMIT on the other
hand tell you exactly what they[re used for.

--
Sami Imseih
Amazon Web Services (AWS)


From: Robert Treat <rob(at)xzilla(dot)net>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-24 21:36:38
Message-ID: CAJSLCQ3Z9cM2eZNa4aOnLmLyiZmSDrZH2xQm1RfT4PdKWo0ZLg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 23, 2026 at 9:31 PM Sami Imseih <samimseih(at)gmail(dot)com> wrote:
>
> Thanks for the detailed summary!
>
> It is important to point out that this feature is trying to do 2 distinct
> things in 1 command. run analyze under when either one of these conditions
> is true:
>
> 1/ Table has not been analyzed yet.
> 2/ Table has been modified.
>

Maybe this is all an aside, but I don't think that was the vision for
what the OP was trying to do with his patch, in that sense he was
approaching it from a different angle, and I've been reading this
thread trying to decide if people are just talking past each other.
But after thinking about it some more, I think the above might be the
more useful mental model for the discussion.

> > Thanks a lot for the detailed feedback — this has been very helpful.Answering to all mails in one.
> >
> > A few clarifications on intent and scope, and how this relates to the points raised:
> >
> > Autovacuum overlap
> > I agree there is some conceptual overlap with autovacuum’s analyze decision logic.
> > The intent here is not to replace or duplicate autovacuum heuristics, but to reduce
>
> Yes, I agree with this.
>
> > I agree that n_mod_since_analyze == 0 is a very simple condition
> > and not “smart” in the general sense. That is intentional for now.
> > This option is not trying to answer when statistics should be refreshed optimally,
> > but only to skip relations that are known to be unchanged since the last analyze.
> > If even a single tuple is modified, SMART ANALYZE will still re-run, preserving
> > conservative behavior.
>
> Yes, this is my concern. Why would I want to analyze if 1 row or a negligible
> amount of rows are modified? I understand that this feature is trying to
> keep the decision making very simple, but I think it's too simple to actually
> be helpful in addressing the wasted effort of an ANALYZE command.
>
> > Tables never analyzed
> > As Christoph and Ilia pointed out earlier, skipping tables that were never analyzed would be incorrect.
> > The current logic explicitly avoids that by requiring last_analyze or last_autoanalyze to be present
> > before skipping. Tables without prior statistics are always analyzed.
>
> I agree with this, but I think it's more than just tables that have
> not been analyzed.
> What if a new column is added after the last (auto)analyze. Would we not want to
> trigger an analyze in that case?
>

Well, I don't know that we are "triggering" anything, but this is
definitely a case where we have "missing stats".

> > Relation to vacuumdb --missing-stats-only
> > I agree this is related but slightly different in intent. --missing-stats-only
> > answers “does this table have any statistics at all?”, while SMART ANALYZE
> > answers “has this table changed since the last statistics collection?”. Both seem
> > useful, but they target different use cases. I see SMART ANALYZE primarily
> > as a performance optimization for repeated manual ANALYZE runs on mostly-static schemas.
>
> SMART ANALYZE is trying to answer 2 questions "which table does not
> have any statistics at all"
> and "has this table changed since the last statistics collection?”, right?
>
> So, maybe they need to be 2 separate options.
>
> > Although as sami said this SMART is not smart enough as it should be ,
> > I will change name accordingly in the further patches
>
> Yup, I am not too fond of SMART in the name. Also, then name itself
> is vague. SKIP_LOCKED and BUFFER_USAGE_LIMIT on the other
> hand tell you exactly what they[re used for.
>

So, tossing out a new proposal here, which is to offer ANALYZE with 2
new options... MISSING_STATS and MODIFIED_STATS.

When MISSING_STATS is passed, we attempt to analyze only tables that
have missing stats, essentially implementing a version of
--missing-stats-only but for the ANALYZE command. In successive runs,
this should reduce towards a no-op, although we need to decide what to
do about system tables, which, iirc --missing-stats-only always
assumes to be true, but this version probably doesn't want to assume
that.

When MODIFIED_STATS is passed, we would instead only analyze tables
where some threshold of rows has been modified. I feel like the most
obvious choice for this calculation would be based on a formula like
"analyze threshold = analyze base threshold + analyze scale factor *
number of tuples". Astute observers will note that this is the same
threshold used by autoanalyze, which means if you had the same
defaults you are just doing the work manually that autoanalyze would
eventually get around to doing (which seems potentially useful on its
own). But also if these were based on gucs, the OP could modify those
gucs to achieve their desired behavior, ie.
set analyze_base_threshold=1; set analyze_scale_factor=0; analyze
(modified_stats); // this should analyze anything with 1 modified row
Granted, I don't like that it is both more wordy than the original
idea, and that we would need to add new gucs, but this would be pretty
flexible.

Robert Treat
https://xzilla.net


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-25 16:52:09
Message-ID: CAA5RZ0uBRmE5k6Q=8PkrttDVwgDh9r=fN0TTHsU48sb_E7KfMw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> > > Relation to vacuumdb --missing-stats-only
> > > I agree this is related but slightly different in intent. --missing-stats-only
> > > answers “does this table have any statistics at all?”, while SMART ANALYZE
> > > answers “has this table changed since the last statistics collection?”. Both seem
> > > useful, but they target different use cases. I see SMART ANALYZE primarily
> > > as a performance optimization for repeated manual ANALYZE runs on mostly-static schemas.
> >
> > SMART ANALYZE is trying to answer 2 questions "which table does not
> > have any statistics at all"
> > and "has this table changed since the last statistics collection?”, right?
> >
> > So, maybe they need to be 2 separate options.
> >
> > > Although as sami said this SMART is not smart enough as it should be ,
> > > I will change name accordingly in the further patches
> >
> > Yup, I am not too fond of SMART in the name. Also, then name itself
> > is vague. SKIP_LOCKED and BUFFER_USAGE_LIMIT on the other
> > hand tell you exactly what they[re used for.
> >
>
> So, tossing out a new proposal here, which is to offer ANALYZE with 2
> new options... MISSING_STATS and MODIFIED_STATS.

Yes, that is what I am thinking as well.

> When MISSING_STATS is passed, we attempt to analyze only tables that
> have missing stats, essentially implementing a version of
> --missing-stats-only but for the ANALYZE command. In successive runs,
> this should reduce towards a no-op, although we need to decide what to
> do about system tables, which, iirc --missing-stats-only always
> assumes to be true, but this version probably doesn't want to assume
> that.

From a quick test, I don't see system tables being treated different

```
postgres=# select max(last_analyze), max(last_autoanalyze) from
pg_stat_all_tables where relname = 'pg_class';
max | max
-----+-----
|
(1 row)

postgres=# delete from pg_statistic;
DELETE 417
postgres=# \! vacuumdb --analyze-only --missing-stats-only postgres
vacuumdb: vacuuming database "postgres"
postgres=# select max(last_analyze), max(last_autoanalyze) from
pg_stat_all_tables where relname = 'pg_class';
max | max
-------------------------------+-----
2026-01-25 16:31:42.839329+00 |
(1 row)

postgres=# \! vacuumdb --analyze-only --missing-stats-only postgres
vacuumdb: vacuuming database "postgres"
postgres=#
postgres=# select max(last_analyze), max(last_autoanalyze) from
pg_stat_all_tables where relname = 'pg_class';
max | max
-------------------------------+-----
2026-01-25 16:31:42.839329+00 |
(1 row)
```

Tables that remain empty, will always be analyzed since they will
always have "missing stats".
For example pg_sequence, if a sequence is never created. The same
applies for normal
user tables.

> When MODIFIED_STATS is passed, we would instead only analyze tables
> where some threshold of rows has been modified. I feel like the most
> obvious choice for this calculation would be based on a formula like
> "analyze threshold = analyze base threshold + analyze scale factor *
> number of tuples". Astute observers will note that this is the same
> threshold used by autoanalyze, which means if you had the same
> defaults you are just doing the work manually that autoanalyze would
> eventually get around to doing (which seems potentially useful on its
> own).

Yes, we would want to use the same calculation as autoanalyze.

> But also if these were based on gucs, the OP could modify those
> gucs to achieve their desired behavior, ie.
> set analyze_base_threshold=1; set analyze_scale_factor=0; analyze
> (modified_stats); // this should analyze anything with 1 modified row
> Granted, I don't like that it is both more wordy than the original
> idea, and that we would need to add new gucs, but this would be pretty
> flexible.

We can either allow the threshold and scale_factor be an argument to
the option; but I do really think the GUC approach is much better.

Not in scope, but I can even see vacuum_threshold and vacuum_scale_factor
to allow us to control VACUUM the same way.

Overall, this becomes very handy for scripting of manual ANALYZE
and VACUUM.

--
Sami Imseih
Amazon Web Services (AWS)


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-25 17:23:39
Message-ID: CAA5RZ0t=fGScwGL9=_HJzXf3808-U4zqne+qAob6R2PYcj-9YA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>> I agree with this, but I think it's more than just tables that have
>> not been analyzed.
>> What if a new column is added after the last (auto)analyze. Would we not want to
>> trigger an analyze in that case?
>>

> Well, I don't know that we are "triggering" anything, but this is
> definitely a case where we have "missing stats".

> When MISSING_STATS is passed, we attempt to analyze only tables that
> have missing stats, essentially implementing a version of
> --missing-stats-only

I also want to do add that the benefit of implementing a --missing-stats
fo ANALYZE is that the timestamps in pg_stat_all_tables are cleared on
crash recovery, but pg_statistic is obviously persistent. So it is
better to look directly there as --missing-stats-only does for vacuumdb.

unfortunately, this is not the case for n_mod_since_analyze, because that
does not survive. There is discussion about improving this situation
however [0].

[0] https://www.postgresql.org/message-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

--
Sami Imseih
Amazon Web Services (AWS)


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-27 05:11:09
Message-ID: CAE2r8H5+kdis5JZxte_i7f1X8AVyUedh_PW1Kc5iOHaFV9=7Qw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>Maybe this is all an aside, but I don't think that was the vision for
>what the OP was trying to do with his patch, in that sense he was
>approaching it from a different angle, and I've been reading this
>thread trying to decide if people are just talking past each other.
>But after thinking about it some more, I think the above might be the
>more useful mental model for the discussion.

Hi Robert,

Thanks for taking the time to step back and summarize the discussion —
that’s very helpful.

I agree that part of the thread drifted toward broader questions about
when ANALYZE should run optimally, whereas my original
intent was narrower: providing a simple, explicit way to skip work that is
known to be unnecessary when running ANALYZE
across many relations. I appreciate you calling out that distinction.

Also, just a small note: I’m a she, not a he :)

coming to other mails:

Thanks everyone, this discussion has been extremely helpful.

I agree with the framing that has emerged here: there are really two
separate questions involved:

1) Which relations are missing statistics entirely?
2) Which relations have statistics, but may need them refreshed due to
modifications?

My original SMART ANALYZE prototype was trying to approximate both with
a very simple rule, but I agree that this makes the option vague and
harder to reason about, especially once cases like new columns, crash
recovery, extended statistics, and persistence are considered.

The idea of introducing explicit options such as ANALYZE (MISSING_STATS)
and ANALYZE (MODIFIED_STATS) feels like a much cleaner direction.
In particular, starting with MISSING_STATS as a SQL-level equivalent of
vacuumdb --missing-stats-only seems like a well-scoped and low-risk
first step.

I also agree that relying solely on pg_stat counters (e.g.
n_mod_since_analyze) has limitations due to their non-persistence across
crashes, which further supports handling “missing stats” separately
via catalog inspection.

I’m happy to pivot in this direction and focus first on a clear,
well-defined MISSING_STATS option for ANALYZE, and then revisit
MODIFIED_STATS (possibly reusing autoanalyze-style thresholds) as a
follow-up, once there is agreement on the semantics.

Thanks again for the thoughtful feedback — it’s been very educational.

Best regards,
Vasuki M
C-DAC,Chennai.


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-27 18:16:41
Message-ID: CAA5RZ0s0y+tEQHiV4VWp+yopdX9hwyuS8GopHsP9YmK0peLFOA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> The idea of introducing explicit options such as ANALYZE (MISSING_STATS)
> and ANALYZE (MODIFIED_STATS) feels like a much cleaner direction.
> In particular, starting with MISSING_STATS as a SQL-level equivalent of
> vacuumdb --missing-stats-only seems like a well-scoped and low-risk
> first step.

> I’m happy to pivot in this direction and focus first on a clear,
> well-defined MISSING_STATS option for ANALYZE, and then revisit
> MODIFIED_STATS (possibly reusing autoanalyze-style thresholds) as a
> follow-up, once there is agreement on the semantics.

I agree with this.

Thanks!

Sami Imseih
Amazon Web Services (AWS)


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 12:19:47
Message-ID: CAE2r8H5ZYiFxYzhWDAyHO1jxSn7oB-9hPV8VdhAkkJjCwfg0HQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

I am back with a new patch.

As a follow-up to the previous discussion, I’m posting a revised v3 patch
that introduces an opt-in ANALYZE (MISSING_STATS) option.

The goal is to reduce unnecessary work when running manual ANALYZE over
many relations, while keeping the default behavior unchanged.

Overview:
ANALYZE (MISSING_STATS) analyzes only relations that currently have no
statistics entries in pg_statistic. Relations that already have statistics
are skipped.

This is conceptually similar to vacuumdb --missing-stats-only, but exposed
at the SQL ANALYZE level for interactive and scripted use.

The decision is intentionally table-level and conservative:

- If a relation has at least one pg_statistic entry, it is considered to
“have stats” and may be skipped.
- If no pg_statistic rows exist (new table, stats removed, crash reset),
the relation is analyzed.

No thresholds or modification counters are used in this option.

Behavior summary for ANALYZE(MISSING_STATS);

New empty table------------------------------->analyzed
Re-run on empty table------------------------> analyzed
Table with data but no stats-----------------> analyzed
Re-run after stats exist-----------------------> skipped
Add new column after ANALYZE----------> analyzed
Re-run after column stats exist-------------> skipped
Statistics manually deleted (pg_statistic) -> analyzed
Statistics lost after crash recovery-----------> analyzed
Regular ANALYZE -------------------------------> unchanged behavior

This ensures that ANALYZE (MISSING_STATS) converges toward a no-op on
subsequent runs, while still recovering from missing or invalid statistics.

Scope and limitations:
- Applies only to regular relations.
- Uses pg_statistic directly (not pg_stats or pg_stat views).
- Does not consider modification thresholds or autovacuum heuristics.
- Partitioned tables, inheritance, and extended statistics are not handled
yet and can be considered separately.

I would appreciate feedback on:

- Whether this behavior and naming align with expectations.
- Any edge cases I may have missed.
- Whether this is a reasonable first step before considering more advanced
options (e.g., modified-stats thresholds).

While testing i have noted this :
analyze_test=# ALTER TABLE ms1 ADD COLUMN b int;
ALTER TABLE
Time: 44.665 ms
analyze_test=# ANALYZE (MISSING_STATS);
DEBUG: ANALYZE processing relation "ms1" (OID 32791)
analyze_test=# SELECT attname
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a
ON a.attrelid = c.oid AND a.attnum = s.staattnum
WHERE c.relname = 'ms1'
ORDER BY attname;
attname
---------
a
b
(2 rows)

Time: 1.390 ms
analyze_test=# ANALYZE (MISSING_STATS);
DEBUG: ANALYZE (MISSING_STATS): skipping relation "ms1" (OID 32791)
analyze_test=# SELECT
a.attname,
s.stanullfrac,
s.stadistinct
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a
ON a.attrelid = c.oid AND a.attnum = s.staattnum
WHERE c.relname = 'ms1';
attname | stanullfrac | stadistinct
---------+-------------+-------------
a | 0 | -1
b | 1 | 0
(2 rows)

Time: 0.733 ms

Note:
[1] The empty table[if a table has 0 rows] --> no pg_statistic rows at all
[2] If a table has >= 1 row then postgres creates pg_statistic rows for
every user column,even if the
- Column is entirely null
- The column was added later
- The column has never had a non-null values

Thanks for your time and review, I will post the next patch for
modified_stats shortly.

*Vasuki MC-DAC,Chennai*

Attachment Content-Type Size
v3-0001-ANALYZE-add-MISSING_STATS-option.patch text/x-patch 8.4 KB

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 15:25:12
Message-ID: 81f859c4-aeb3-4405-b7de-7df9c3dcdac3@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 29.01.2026 15:19, VASUKI M wrote:
> I am back with a new patch.

1. It would be good to test this with assertions enabled. it would be
good to test this with assertions enabled. I noticed: MISSING_STATS is
currently defined withthe same bit value as FREEZE (0x08).

2. The current placement of the MISSING_STATS check insideanalyze_rel()
looks problematic. Right now it is executed immediately after
vacuum_open_relation(), beforewe verify that the relation was
successfully opened and is not pg_statistic, the user has privileges. It
seems more appropriate to move the MISSING_STATS check further
down,after the standard skip conditions.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 18:54:30
Message-ID: CAA5RZ0tcacqD_9VQUWk4-b-8BpDZ-CwPd=xuRQoxWBpStsHDSg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for the patch!

> This is conceptually similar to vacuumdb --missing-stats-only, but exposed
> at the SQL ANALYZE level for interactive and scripted use.

What is being proposed in v3 is not similar in behavior.

I am wondering if we should take the current SQL used by vacuumdb to
find missing stats and perform direct syscache lookups in C?

The benefit there is we can also change vacuumdb to use the ANALYZE
(MISSING_STATS) option
directly rather than deriving the SQL, and the MISSING_STATS behavior
will be the same
if we do this type of analyze from vacuumdb or manually via ANALYZE command.

Attached is the query used by vacuumdb on HEAD.

What do you think?

--
Sami Imseih
Amazon Web Services (AWS)

Attachment Content-Type Size
vacuumdb_missing_stats_only.sql application/octet-stream 3.5 KB

From: Sami Imseih <samimseih(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 21:14:18
Message-ID: CAA5RZ0uBeeM1psLPpWsw3rxRVpewKnMfs+jz+XAsrck7UW6YyQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sorry for the quick follow-up.

> > This is conceptually similar to vacuumdb --missing-stats-only, but exposed
> > at the SQL ANALYZE level for interactive and scripted use.
>
> What is being proposed in v3 is not similar in behavior.
>
> I am wondering if we should take the current SQL used by vacuumdb to
> find missing stats and perform direct syscache lookups in C?
>
> The benefit there is we can also change vacuumdb to use the ANALYZE
> (MISSING_STATS) option
> directly rather than deriving the SQL, and the MISSING_STATS behavior
> will be the same
> if we do this type of analyze from vacuumdb or manually via ANALYZE command.
>
> Attached is the query used by vacuumdb on HEAD.
>
> What do you think?

After looking at this a bit more, it occurred to me there are 2 routines in
analyze.c and extended_stats.c that are called examine_attribute(), and
their purpose is to check if an attribute should be analyzed and if so, return
a VacAttrStats.

```
/*
* examine_attribute -- pre-analysis of a single column
*
* Determine whether the column is analyzable; if so, create and initialize
* a VacAttrStats struct for it. If not, return NULL.
*/
static VacAttrStats *
examine_attribute(Node *expr)
```

```
/*
* examine_attribute -- pre-analysis of a single column
*
* Determine whether the column is analyzable; if so, create and initialize
* a VacAttrStats struct for it. If not, return NULL.
*
* If index_expr isn't NULL, then we're trying to analyze an expression index,
* and index_expr is the expression tree representing the column's data.
*/
static VacAttrStats *
examine_attribute(Relation onerel, int attnum, Node *index_expr)
```

So, we may just need to implement 2 new helpers that check if column
stats or extended
stats are missing from pg_statistic/pg_statistic_ext, and we can then
check the following

1/ if missing_stats_only flag is set to TRUE
2/ and examine_attribute returns VacAttrStats ( attribute is analyzable )
3/ and we do not have stats in pg_statististic or pg_statistic_ext
for the attribute ( the 2 routines can determine this )

The attribute should then be analyzed.

This should also work without much more effort if specific columns are
specified in the
ANALYZE command:

```
analyze table1 (col1, col2);
```

Another comment I have:

> - Whether this behavior and naming align with expectations.

I think the option name should be called MISSING_STATS_ONLY,
which is both clearer in intention and matches the option in vacuumdb.

--
Sami Imseih
Amazon Web Services (AWS)


From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 21:40:28
Message-ID: CADkLM=eYG-ug_zcsr12BoD_i+bWhqnrCt8rXH4ULui42B5_tcQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I am wondering if we should take the current SQL used by vacuumdb to
> find missing stats and perform direct syscache lookups in C?
>

So....about that. The exiting missing-stats-only queries test for a
corresponding pg_statistic_ext_data row for any pg_statistic_ext row that
meets the relation filters, but at this very moment we can restore all
types of extended stats _except_ expressions. That functionality could make
it into 19, but if it doesn't we're going to have to adjust vacuumdb to
probe pg_statistic_ext.stxkeys for expression indexes and look for matching
stxdexprs elements. I agree that those matches are better done with
syscache lookups, but the SQL that we're treating as a spec might be a
moving target in the near future.


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 22:24:02
Message-ID: CAA5RZ0vi7=pdB9xgu3U5pGkgXa_63+ZgE56OL-YdNTePn2oxtA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>> I am wondering if we should take the current SQL used by vacuumdb to
>> find missing stats and perform direct syscache lookups in C?
>
>
> So....about that. The exiting missing-stats-only queries test for a corresponding
> pg_statistic_ext_data row for any pg_statistic_ext row that meets the relation filters,
> but at this very moment we can restore all types of extended stats _except_ expressions.
> That functionality could make it into 19, but if it doesn't we're going to have to adjust
> vacuumdb to probe pg_statistic_ext.stxkeys for expression indexes and look for
> matching stxdexprs elements. I agree that those matches are better done with
> syscache lookups, but the SQL that we're treating as a spec might be a moving
> target in the near future.

Eventually we will want vacuumdb to use the "ANALYZE (MISSING_STATS)" command
directly, rather than the SQL, but until the restore functionality
works for extended stats
of expressions, we will need to keep those separated. Did I understand
that correctly?

--
Sami Imseih
Amazon Web Services (AWS)


From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 22:48:50
Message-ID: CADkLM=di2r1FbDdL8=NWkcK3pE10JeBcOKg8978wU+fExgYz9A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 29, 2026 at 5:24 PM Sami Imseih <samimseih(at)gmail(dot)com> wrote:

> >> I am wondering if we should take the current SQL used by vacuumdb to
> >> find missing stats and perform direct syscache lookups in C?
> >
> >
> > So....about that. The exiting missing-stats-only queries test for a
> corresponding
> > pg_statistic_ext_data row for any pg_statistic_ext row that meets the
> relation filters,
> > but at this very moment we can restore all types of extended stats
> _except_ expressions.
> > That functionality could make it into 19, but if it doesn't we're going
> to have to adjust
> > vacuumdb to probe pg_statistic_ext.stxkeys for expression indexes and
> look for
> > matching stxdexprs elements. I agree that those matches are better done
> with
> > syscache lookups, but the SQL that we're treating as a spec might be a
> moving
> > target in the near future.
>
> Eventually we will want vacuumdb to use the "ANALYZE (MISSING_STATS)"
> command
> directly, rather than the SQL, but until the restore functionality
> works for extended stats
> of expressions, we will need to keep those separated. Did I understand
> that correctly?
>

Yes, but no, but yes (eventually). :)

Yes, if we implemented ANALYZE(MISSING_STATS_ONLY), then yes, we'd want to
leverage that that in vacuumdb once we know what versions it is available
for, as what constitutes "missing" will change from version to version, and
it would be nice to insulate vacuumdb from that. It will have to account
for what "missing" meant in past versions though.

But before we get there, we have to contend with the fact that what
constitutes "missing" has already subtly changed since v18, that change is
not yet reflected in vacuumdb, and ideally the definition would change back
to the v18 definition before v19 feature freeze, but that isn't guaranteed.


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-29 23:07:40
Message-ID: CAA5RZ0t8ZQRnenD39+922-25zLQKQWqTVewi0x0r9f9yBiX2-g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>> >> I am wondering if we should take the current SQL used by vacuumdb to
>> >> find missing stats and perform direct syscache lookups in C?
>> >
>> >
>> > So....about that. The exiting missing-stats-only queries test for a corresponding
>> > pg_statistic_ext_data row for any pg_statistic_ext row that meets the relation filters,
>> > but at this very moment we can restore all types of extended stats _except_ expressions.
>> > That functionality could make it into 19, but if it doesn't we're going to have to adjust
>> > vacuumdb to probe pg_statistic_ext.stxkeys for expression indexes and look for
>> > matching stxdexprs elements. I agree that those matches are better done with
>> > syscache lookups, but the SQL that we're treating as a spec might be a moving
>> > target in the near future.
>>
>> Eventually we will want vacuumdb to use the "ANALYZE (MISSING_STATS)" command
>> directly, rather than the SQL, but until the restore functionality
>> works for extended stats
>> of expressions, we will need to keep those separated. Did I understand
>> that correctly?
>
>
> Yes, but no, but yes (eventually). :)
>
> Yes, if we implemented ANALYZE(MISSING_STATS_ONLY), then yes, we'd want to leverage
> that that in vacuumdb once we know what versions it is available for, as what constitutes "missing"
> will change from version to version, and it would be nice to insulate vacuumdb from that. It will have
> to account for what "missing" meant in past versions though.

of course, because vacuumdb is a client tool, and can connect to any
version. So, vacuumdb will need
to check for server version and either run
"ANALYZE(MISSING_STATS_ONLY)" or the full SQL.

> But before we get there, we have to contend with the fact that what constitutes "missing" has already
> subtly changed since v18, that change is not yet reflected in vacuumdb, and ideally the definition
> would change back to the v18 definition before v19 feature freeze, but that isn't guaranteed.

OK, I am confused a bit about the details of this point, but it looks
like this work is happening
in another thread, maybe [0] ?

So with regards to this thread, vacuumdb using this new option will be
out of scope. This could
be handled in a future thread.

[0] https://www.postgresql.org/message-id/aRrHBj7I4CeV21dj@paquier.xyz

--
Sami Imseih
Amazon Web Services (AWS)


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-30 04:59:03
Message-ID: CAE2r8H5kDpOTZv5sRGYe0BF2gN45fgCKOJ2GD-tpwsJKAxWsVg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Thanks a lot for the detailed and thoughtful feedback. I want to confirm my
understanding of the points raised and how I plan to align the work going
forward.

On the implementation details raised by Ilia:
I agree that the current flag definition and placement need correction. The
MISSING_STATS option must use a unique flag value, and the check should be
performed only after the standard relation validation, privilege checks,
and existing skip conditions in analyze_rel(). I will fix both and verify
the behavior with assertions enabled.

Regarding behavior and scope, I understand the concern that the current v3
implementation does not exactly match vacuumdb --missing-stats-only
semantics. My intent was not to claim behavioral equivalence, but to expose
a server-side ANALYZE option that identifies relations requiring statistics
collection. I agree that this distinction needs to be made clearer.

The suggestion to reuse existing ANALYZE internals, in particular
examine_attribute(), makes sense. Leveraging that logic to determine
whether analyzable attributes lack statistics should align the
implementation more closely with core ANALYZE behavior and avoid
re-defining missing-stats rules independently.[Thanks Sami for teaching me
this as i am an new contributor:) ]

On vacuumdb integration: I agree that this patch should focus only on
defining server-side ANALYZE semantics. As noted, vacuumdb is a client tool
that must handle multiple server versions, and the definition of “missing
stats” itself is evolving (especially for extended and expression
statistics). Adapting vacuumdb to use ANALYZE(MISSING_STATS_ONLY) can be
considered as follow-up work once the server-side behavior is finalized.

Finally, I agree that the option name should be MISSING_STATS_ONLY for
clarity and consistency with vacuumdb.

Thanks again for the guidance — it has been very helpful in understanding
the expected direction and design constraints. I will post an updated patch
addressing the above points.

*Vasuki MC-DAC,Chennai*


From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-30 05:28:03
Message-ID: CADkLM=eFRiQs6hZoPa+NrL8zUBjUF=BoSsLZOV-xG+LaRuL+5Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> > But before we get there, we have to contend with the fact that what
> constitutes "missing" has already
> > subtly changed since v18, that change is not yet reflected in vacuumdb,
> and ideally the definition
> > would change back to the v18 definition before v19 feature freeze, but
> that isn't guaranteed.
>
> OK, I am confused a bit about the details of this point, but it looks
> like this work is happening
> in another thread, maybe [0] ?
>

Yes, but that thread was about to close and it was in the process of being
moved to [1] as I was writing that message. The only thing to keep in mind
is that if the effort in [1] stalls, then the definition of missing in
vacuumdb will likely get marginally more complex. I hope that doesn't
happen, and I believe that it won't, but I don't want anybody blind-sided
if it does.

So with regards to this thread, vacuumdb using this new option will be
> out of scope. This could
> be handled in a future thread.
>

+1

[1]
https://www.postgresql.org/message-id/flat/CADkLM%3DfPcci6oPyuyEZ0F4bWqAA7HzaWO%2BZPptufuX5_uWt6kw%40mail.gmail.com


From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-30 05:34:09
Message-ID: CADkLM=dLGe9qD5XDV8ROBXmcOgVxPYgWGp27TBz3EhdPThundg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

The suggestion to reuse existing ANALYZE internals, in particular
> examine_attribute(), makes sense. Leveraging that logic to determine
> whether analyzable attributes lack statistics should align the
> implementation more closely with core ANALYZE behavior and avoid
> re-defining missing-stats rules independently.[Thanks Sami for teaching me
> this as i am an new contributor:) ]
>

You'll find some useful code for the syscache lookups in
attribute_stats_update() and extended_statistics_update() for determining
if attributes and stats objects have matching stats.


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-13 12:14:30
Message-ID: CAE2r8H7vnihrOE5i+RvpU72jcyRkxMkhCY9mzBb+PBrTzSmo5g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

First of all, sorry for the delay in posting the updated patch. I was
occupied with some other work and holidays, and it took me some time to
revisit the design carefully. After several iterations and reworking the
logic based on the earlier feedback, I have now arrived at what I believe
is a cleaner and more aligned implementation. I am attaching v4 of the
patch for review.

The intention is to provide a SQL-level equivalent of vacuumdb
--analyze-only --missing-stats-only, while keeping the default ANALYZE
behavior completely unchanged. When this option is specified, ANALYZE will
process only those relations that are missing statistics, and skip
relations that already have complete statistics.

A relation is considered to be missing statistics if at least one
analyzable attribute (as determined by examine_attribute()) does not have a
corresponding row in pg_statistic, or if there is an extended statistics
object in pg_statistic_ext without a matching row in pg_statistic_ext_data.
In other words, the implementation relies on catalog inspection rather than
counters or timestamps, and it reuses existing ANALYZE internals instead of
redefining missing-stats logic independently.

The check is placed inside analyze_rel(), after the standard relation
validation and skip conditions, so that it does not interfere with
privilege checks or special-relation handling. The default ANALYZE path
remains unchanged.

In terms of behavior:

* A brand new empty table is analyzed (since it has no statistics).
* Re-running on an empty table analyzes again, because there are still no
pg_statistic rows.
* A table with data but no statistics is analyzed.
* Re-running after statistics exist causes the table to be skipped.
* If a new column is added and lacks statistics, the table is analyzed
again.
* After statistics are created for that column, subsequent runs skip the
table.
* If statistics are manually deleted or effectively lost (e.g., crash
recovery scenarios affecting stats tracking), the table is analyzed again.

Repeated runs therefore converge toward a no-op once all relations have
complete statistics.

Regression tests are included.

As discussed earlier in the thread, I plan to start a new discussion and
patch series for a separate ANALYZE (MODIFIED_STATS) option that would
reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY
and MODIFIED_STATS as separate, clearly defined options makes the semantics
easier to reason about.

I would greatly appreciate further review and feedback on this version.
Thank you all for the detailed guidance and suggestions so far — especially
regarding reuse of examine_attribute() and alignment with vacuumdb
behavior. This process has been very educational for me.

Thanks,
Vasuki M
C-DAC, Chennai

Attachment Content-Type Size
v4-0001-ANALYZE-add-MISSING_STATS_ONLY-option.patch text/x-patch 10.8 KB

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>
Cc: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-13 16:27:15
Message-ID: be509f90-74bd-4a81-83a7-d5afa0d17b38@proxel.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 2/13/26 1:14 PM, VASUKI M wrote:
> As discussed earlier in the thread, I plan to start a new discussion and
> patch series for a separate ANALYZE (MODIFIED_STATS) option that would
> reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY
> and MODIFIED_STATS as separate, clearly defined options makes the
> semantics easier to reason about.

When would a user ever want MODIFIED_STATS without also analyzing tables
without any stats? I had more thought of calling it something like
SKIP_UNMODIFIED but maybe there is a case for MISSING_STATS_ONLY,
especially since the command line tool supports it.

Andreas


From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-13 18:21:01
Message-ID: CADkLM=euYKx6aCLNf=Xp5=2hL+xTPQPqVN+AvUvpDW=XWotWDQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

>
> * A brand new empty table is analyzed (since it has no statistics).
> * Re-running on an empty table analyzes again, because there are still no
> pg_statistic rows.
> * A table with data but no statistics is analyzed.
> * Re-running after statistics exist causes the table to be skipped.
> * If a new column is added and lacks statistics, the table is analyzed
> again.
> * After statistics are created for that column, subsequent runs skip the
> table.
> * If statistics are manually deleted or effectively lost (e.g., crash
> recovery scenarios affecting stats tracking), the table is analyzed again.
>
> Repeated runs therefore converge toward a no-op once all relations have
> complete statistics.
>
> Regression tests are included.
>
> As discussed earlier in the thread, I plan to start a new discussion and
> patch series for a separate ANALYZE (MODIFIED_STATS) option that would
> reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY
> and MODIFIED_STATS as separate, clearly defined options makes the semantics
> easier to reason about.
>
> I would greatly appreciate further review and feedback on this version.
> Thank you all for the detailed guidance and suggestions so far — especially
> regarding reuse of examine_attribute() and alignment with vacuumdb
> behavior. This process has been very educational for me.
>
>
>
A few notes:

- attnum variables can be of type AttrNumber
- The SearchSysCache3 lookup can't assume inh = false. Partitioned tables
will ONLY have inh = true. Inheritance parents will have both.
- Same inh concerns for extended stats.


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-16 09:38:44
Message-ID: CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Andreas,

Thank you for raising this — it’s a very good design question.

You’re right that in many practical cases, a user invoking something like
ANALYZE (MODIFIED_STATS) would also want to include relations that
currently have no statistics. From an operational perspective, “missing
stats” and “modified stats” can overlap.

In my earlier prototype, I did attempt to handle both concerns together.
However, during the previous discussion in the thread, it became clear that
combining the semantics made the behavior less predictable and harder to
reason about. That led to splitting the functionality into two more clearly
defined options:

MISSING_STATS_ONLY → analyze relations lacking statistics.

MODIFIED_STATS (proposed) → analyze relations whose statistics may be stale
due to modifications.

The motivation for separation was semantic clarity:

MISSING_STATS_ONLY is catalog-based and persistent (derived from
pg_statistic / pg_statistic_ext).

MODIFIED_STATS would likely depend on modification counters or thresholds
(similar to autoanalyze logic), which are transient and not
crash-persistent.

Keeping them distinct allows each option to have a well-defined and
predictable contract.

That said, your naming suggestion is interesting. A name such as
SKIP_UNMODIFIED does express the behavior from the inverse perspective and
may indeed be clearer. Another possible direction could be:

ANALYZE (MISSING_STATS_ONLY)

ANALYZE (SKIP_UNMODIFIED)

Or potentially allowing both options together, if that proves semantically
consistent.

I’m very open to adjusting the naming and/or semantics if the consensus is
that a combined approach would be more practical.

Thank you again for the thoughtful feedback — it’s very helpful in refining
the direction of this work.

Best regards,
Vasuki M
C-DAC,Chennai


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-16 12:19:53
Message-ID: CAE2r8H4EE9wTwfoykooJpyL1qZZweJoRdm0x2OBHeRd+riOMWg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

On Fri, Feb 13, 2026 at 11:51 PM Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
wrote:

> A few notes:
>
> - attnum variables can be of type AttrNumber
> - The SearchSysCache3 lookup can't assume inh = false. Partitioned tables
> will ONLY have inh = true. Inheritance parents will have both.
> - Same inh concerns for extended stats.
>
Done.

I am attaching v5 of the MISSING_STATS_ONLY patch.

The main changes are in how we detect existing statistics for inherited and
partitioned tables.

Previously, the column stats check assumed stainherit = false in the
syscache lookup. As Corey pointed out, that assumption is incorrect:

Partitioned tables only have stainherit = true.

Inheritance parents can have both true and false entries.

In this version, the column stats check now tries both cases (stainherit =
false and stainherit = true) before concluding that stats are missing. A
column is considered missing stats only if neither entry exists.

The same logic has been applied to extended statistics as well.

Other than this correction, the overall semantics of ANALYZE
(MISSING_STATS_ONLY) remain unchanged:

Relations with at least one analyzable attribute or extended stats object
lacking statistics are analyzed.

Relations with complete statistics are skipped.

Repeated runs converge toward a no-op once stats are fully populated.

Regression tests were updated and re-run after this change.

Please let me know if the inheritance handling now matches expectations, or
if there are additional cases I should consider.

As discussed earlier, I will start a separate thread for the modified-stats
behavior once this direction is settled.

Thanks again for the review and guidance.

Regards,
Vasuki M
C-DAC,Chennai

Attachment Content-Type Size
v5-0001-ANALYZE-add-MISSING_STATS_ONLY-option.patch text/x-patch 11.7 KB

From: Robert Treat <rob(at)xzilla(dot)net>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, Sami Imseih <samimseih(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-18 19:32:06
Message-ID: CAJSLCQ3CoEjd=DiANwyBybFaOu24PZFXo5f8EQUbsZ+UL0wL0A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 16, 2026 at 4:38 AM VASUKI M <vasukianand0119(at)gmail(dot)com> wrote:
>
> Hi Andreas,
>
> Thank you for raising this — it’s a very good design question.
>
> You’re right that in many practical cases, a user invoking something like ANALYZE (MODIFIED_STATS) would also want to include relations that currently have no statistics. From an operational perspective, “missing stats” and “modified stats” can overlap.
>
> In my earlier prototype, I did attempt to handle both concerns together. However, during the previous discussion in the thread, it became clear that combining the semantics made the behavior less predictable and harder to reason about. That led to splitting the functionality into two more clearly defined options:
>
> MISSING_STATS_ONLY → analyze relations lacking statistics.
>
> MODIFIED_STATS (proposed) → analyze relations whose statistics may be stale due to modifications.
>
> The motivation for separation was semantic clarity:
>
> MISSING_STATS_ONLY is catalog-based and persistent (derived from pg_statistic / pg_statistic_ext).
>
> MODIFIED_STATS would likely depend on modification counters or thresholds (similar to autoanalyze logic), which are transient and not crash-persistent.
>
> Keeping them distinct allows each option to have a well-defined and predictable contract.
>
> That said, your naming suggestion is interesting. A name such as SKIP_UNMODIFIED does express the behavior from the inverse perspective and may indeed be clearer. Another possible direction could be:
>
> ANALYZE (MISSING_STATS_ONLY)
>
> ANALYZE (SKIP_UNMODIFIED)
>
> Or potentially allowing both options together, if that proves semantically consistent.
>
> I’m very open to adjusting the naming and/or semantics if the consensus is that a combined approach would be more practical.
>

Well, going back to the beginning of the thread, we have two distinct
use cases at the individual level. One (MISSING_STATS) is to quickly
go through the database and ensure they have added statistics for
anything that might be missing them, like new columns, new extended
statistics, etc... The other (MODIFIED_STATS) was having a way to
update statistics in active tables for databases with large numbers of
static tables in a way similar to how autoanalyze works, but available
on demand. While I suspect people will often run both of these
together, those are clearly separate concerns and based on the
original discussions where this was being hashed out, it is easier to
reason about them separately. And while I think you might be able to
argue that MODIFIED_STATS should also include MISSING_STATS (I do
wonder though, does autoanalyze do that?), given the use case of
integrating MISSING_STATS into vacuumdb , it absolutely needs to be a
stand alone flag for that scenario.

One bookkeeping note for VASUKI, I didn't see any commitfest entries
for either patch; I would create one for each of these features
separately within https://commitfest.postgresql.org/58/.

Robert Treat
https://xzilla.net


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-19 18:23:35
Message-ID: CAA5RZ0v02tJ9uBnhR4Uf4EEte4fsLiL3uKXxO+Fcs_-D2kESUw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I am attaching v5 of the MISSING_STATS_ONLY patch.

Thanks for the patch. Here are some comments I have:

1/
+relation_has_missing_column_stats(Relation rel)
+{
+ AttrNumber attnum;
+
+ for (attnum = 1; attnum <= rel->rd_att->natts; attnum++)
+ {
+ VacAttrStats *stats;
+ HeapTuple statstup;
+ bool has_stats = false;
+
+ /*
+ * Step 1: Is this attribute analyzable at all?
+ * (skips dropped columns, system cols, attstattarget=0, etc.)
+ */
+ stats = examine_attribute(rel, attnum, NULL);

- and -

+ /*
+ * ANALYZE (MISSING_STATS_ONLY):
+ * Skip relation if it has no missing column or extended statistics.
+ */
+ if (params.options & VACOPT_MISSING_STATS_ONLY)
+ {
+ if (!relation_has_missing_column_stats(onerel) &&
+ !relation_has_missing_extended_stats(onerel))
+ {
+ elog(DEBUG1,
+ "ANALYZE (MISSING_STATS_ONLY):
skipping relation \"%s\"",
+ RelationGetRelationName(onerel));
+

This approach seems off. relation_has_missing_column_stats() and
elation_has_missing_extended_stats()
are being called early, and relation_has_missing_column_stats() a also
calls examine_attribute() for all attributes.
Later on, examine_attribute() is called again for the required
attributes, so it ends up running twice.

I think the natural thing to do here is to make
relation_has_missing_column_stats() take in
an attnum to perform the lookup in the stats catalog on a specific
attribute. This call should occur inside
examine_attribute().
examine_attribute() can also take in a boolean, missing_stats_only, so
we only look up
the stats catalog only when the missing_stats_only option is
specified. Columns that already
have stats simply return NULL, causing them to be skipped naturally.

With that, inside do_analyze_rel(), we can then check if attr_cnt = 0,
the number of expr_indexes = 0
and we are not missing extended stats, we can just skip the relation
from analyze.

Something like this:

```
if (params.options & VACOPT_MISSING_STATS_ONLY &&
(attr_cnt == 0 && expr_cnt == 0 &&
!relation_has_missing_extended_stats(onerel)))
{
vac_close_indexes(nindexes, Irel, NoLock);
return;
}
```

2/
I think we need more test coverage, inheritance partitions,
declarative partitions,
index on expressions, ANALYZE with a column specified, etc.

3/

+ elog(DEBUG1,
+ "ANALYZE (MISSING_STATS_ONLY):
skipping relation \"%s\"",
+ RelationGetRelationName(onerel));

I think logging should be at info level, and we should only log when
skipping a relation

The following seems unnecessary.

+ elog(DEBUG1,
+ "ANALYZE (MISSING_STATS_ONLY): relation
eligible \"%s\"",
+ RelationGetRelationName(onerel));
+ }
+
+
/*
* OK, let's do it. First, initialize progress reporting.
*/
@@ -314,6 +451,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;

+ elog(DEBUG1, "ANALYZE processing relation \"%s\" (OID %u)",
+ RelationGetRelationName(onerel),
+ RelationGetRelid(onerel));

Also, I think it's better to use ereport like is being done for the other
ANALYZE logging, i.e.

```
appendStringInfo(&buf,
_("WAL usage:
%" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes,
%" PRIu64 " full page image bytes, %" PRId64 " buffers full\n"),
walusage.wal_records,
walusage.wal_fpi,
walusage.wal_bytes,
walusage.wal_fpi_bytes,

walusage.wal_buffers_full);
appendStringInfo(&buf, _("system usage: %s"),
pg_rusage_show(&ru0));

ereport(verbose ? INFO : LOG,
(errmsg_internal("%s", buf.data)));
```

4/
Make sure to run pgindent

> And while I think you might be able to
> argue that MODIFIED_STATS should also include MISSING_STATS (I do
> wonder though, does autoanalyze do that?)

No, autoanalyze decisions are driven purely off of thresholds.

--
Sami Imseih
Amazon Web Services (AWS)


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-02-20 11:52:53
Message-ID: c84ac92f-cf37-4398-a0f3-719274cb8c90@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers


On 2/19/26 21:23, Sami Imseih wrote:
> 3/
>
> + elog(DEBUG1,
> + "ANALYZE (MISSING_STATS_ONLY):
> skipping relation \"%s\"",
> + RelationGetRelationName(onerel));
>
> I think logging should be at info level, and we should only log when
> skipping a relation
>
>
> Also, I think it's better to use ereport like is being done for the other
> ANALYZE logging, i.e.
>
> ```
> appendStringInfo(&buf,
> _("WAL usage:
> %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes,
> %" PRIu64 " full page image bytes, %" PRId64 " buffers full\n"),
> walusage.wal_records,
> walusage.wal_fpi,
> walusage.wal_bytes,
> walusage.wal_fpi_bytes,
>
> walusage.wal_buffers_full);
> appendStringInfo(&buf, _("system usage: %s"),
> pg_rusage_show(&ru0));
>
> ereport(verbose ? INFO : LOG,
> (errmsg_internal("%s", buf.data)));
> ```

+1

I agree that in VERBOSE mode we should display a message when the
relation is skipped. I suggest something along the lines of:

INFO:    Skipping analyzing "database.namespace.relation"

This would be consistent with the style already used in
do_analyze_rel(). Since this is executed ANALYZE, and no statistics are
actually collected , I don't think we need to display any additional
details beyond this single line.

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-03-02 12:48:05
Message-ID: CAE2r8H7hYGYi4QM85Q7bxs4RbT0Vn63c9ONFbwhAjuSGDzah_A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello Robert, Sami, Ilia, and everyone,

Thank you all for the detailed review and thoughtful feedback. I have
carefully gone through the comments and revised the patch accordingly. I
truly appreciate the guidance — it helped clarify both the design and the
implementation details.

Regarding Robert’s point about the two distinct use cases, I agree that
MISSING_STATS_ONLY and MODIFIED_STATS represent separate concerns.
MISSING_STATS_ONLY is catalog-driven and persistent in nature, ensuring
that newly created tables, newly added columns, and newly defined extended
statistics are not left without statistics. MODIFIED_STATS, on the other
hand, is more closely aligned with modification thresholds and
autoanalyze-like behavior, which is transient and threshold-based.

Keeping these concerns separate makes the semantics clearer and easier to
reason about. In particular, as Robert mentioned, MISSING_STATS_ONLY must
remain a standalone flag for scenarios such as integration into vacuumdb,
where ensuring that missing statistics are generated is the primary goal.

To align with the CommitFest process, I have created a separate entry for
this feature here:[1]

I will treat MODIFIED_STATS as a separate patch and discussion thread.

Addressing Sami’s technical comments:

1. Duplicate examine_attribute() calls

You were absolutely correct that the earlier version of the patch caused
examine_attribute() to be invoked twice for each attribute: once in
relation_has_missing_column_stats() and again during normal attribute
processing. That approach was inefficient and awkward.
In the revised version, I have removed the early relation-level column
scanning logic. Instead, the missing-statistics check is now integrated
directly inside examine_attribute(). The function now takes a boolean
missing_stats_only parameter. When this option is specified,
examine_attribute() performs the pg_statistic lookup for that specific
attribute. If statistics already exist, the function simply returns NULL,
causing the column to be skipped naturally.

This ensures:
-Each attribute is examined only once.
-The missing-statistics logic fits naturally into the existing flow.
-We avoid redundant catalog lookups.

Inside do_analyze_rel(), after collecting vacattrstats, we now check
whether:
-attr_cnt == 0
-no expression index attributes remain
-and there are no missing extended statistics
If all of these are true under MISSING_STATS_ONLY, the relation is skipped
cleanly.This follows the structure you suggested and simplifies the overall
design.

2. Test coverage

I have expanded the regression tests to cover the scenarios you
mentioned.This ensures that MISSING_STATS_ONLY behaves correctly across
more complex schema configurations.

3. Logging behavior

Thank you for the guidance on logging.I have removed the DEBUG-level elog()
calls introduced in earlier versions. Logging is now consistent with
existing ANALYZE behavior and uses ereport().
When VERBOSE mode is enabled and a relation is skipped due to
MISSING_STATS_ONLY, the output now includes a single INFO-level message of
the form:

INFO: Skipping analyzing "database.namespace.relation"

This matches the style used elsewhere in do_analyze_rel() and avoids
unnecessary additional details when no statistics are collected.There is no
additional logging when a relation is processed normally, keeping behavior
aligned with existing ANALYZE semantics.

Regarding autoanalyze and MODIFIED_STATS
As Ilia clarified, autoanalyze decisions are purely threshold-driven and do
not include missing-statistics logic. That reinforces the decision to keep
MISSING_STATS_ONLY separate and explicit.[2]

If there are further suggestions regarding semantics, naming, or additional
edge cases that should be tested, I would be very happy to incorporate them.
Thank you again for your time and detailed review.

Regards,
Vasuki M
C-DAC,Chennai

[1] https://commitfest.postgresql.org/patch/6516/
[2]
https://www.postgresql.org/message-id/flat/aZSm77WEh8pxQYtf%40nathan#1ab5c06a7d2247d90e71fb995fa21a39

>

Attachment Content-Type Size
v6-0001-ANALYZE-add-MISSING_STATS_ONLY-option.patch text/x-patch 16.2 KB

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-03-04 11:21:38
Message-ID: b3fc5424-e993-4b3e-b9df-83f7f5f21e79@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Before posting patches, it would be very helpful to run at least 'make
check' and 'make check-world' on a build configured with
--enable-cassert and --enable-tap-tests, as this often helps to catch
issues early.

A few additional moments:

1. If we determine that statistics already exists, it is not enough to
only call vac_close_indexes(). We also need to make sure that everything
else is freed properly: memory contexts, restored privileges, etc.
Please take a look at the end of do_analyze_rel() for the full cleanup
sequence.

2. Static functions should be also have their declarations placed at the
top of the file, following the usual style.

3. The expression (params.options & VACOPT_MISSING_STATS_ONLY) appears
multiple times. It might be simpler to introduce a local variable such
as 'missing_stats_only' and use it instead.

4. The message about skipping analyzing for a relation should follow the
same logging pattern as the message about analyzing a relation, i.e.
using 'verbose ? INFO : LOG'. See the end of the function for reference.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: VASUKI M <vasukianand0119(at)gmail(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-06-17 13:18:39
Message-ID: d18342e7-dfaf-4158-acfc-a60d6ab016f5@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

As there has been no feedback on this thread for more than three months,
I'm moving it to "Return with feedback".

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/