summaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
authorTom Lane2010-10-08 00:00:28 +0000
committerTom Lane2010-10-08 00:00:28 +0000
commit3ba11d3df2115b04171a8eda8e0389e02578d8d0 (patch)
tree9ae749f1499b9e0e00032272d7a5d1c3f1266c02 /doc/src
parent694c56af2b586551afda624901d6dec951b58027 (diff)
Teach CLUSTER to use seqscan-and-sort when it's faster than indexscan.
... or at least, when the planner's cost estimates say it will be faster. Leonardo Francalanci, reviewed by Itagaki Takahiro and Tom Lane
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/ref/cluster.sgml67
1 files changed, 30 insertions, 37 deletions
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4b641954efa..adba2678632 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -128,18 +128,33 @@ CLUSTER [VERBOSE]
</para>
<para>
- During the cluster operation, a temporary copy of the table is created
- that contains the table data in the index order. Temporary copies of
- each index on the table are created as well. Therefore, you need free
- space on disk at least equal to the sum of the table size and the index
- sizes.
+ <command>CLUSTER</> can re-sort the table using either an indexscan
+ on the specified index, or (if the index is a b-tree) a sequential
+ scan followed by sorting. It will attempt to choose the method that
+ will be faster, based on planner cost parameters and available statistical
+ information.
</para>
<para>
- Because <command>CLUSTER</command> remembers the clustering information,
- one can cluster the tables one wants clustered manually the first time, and
- setup a timed event similar to <command>VACUUM</command> so that the tables
- are periodically reclustered.
+ When an indexscan is used, a temporary copy of the table is created that
+ contains the table data in the index order. Temporary copies of each
+ index on the table are created as well. Therefore, you need free space on
+ disk at least equal to the sum of the table size and the index sizes.
+ </para>
+
+ <para>
+ When a sequential scan and sort is used, a temporary sort file is
+ also created, so that the peak temporary space requirement is as much
+ as double the table size, plus the index sizes. This method is often
+ faster than the indexscan method, but if the disk space requirement is
+ intolerable, you can disable this choice by temporarily setting <xref
+ linkend="guc-enable-sort"> to <literal>off</>.
+ </para>
+
+ <para>
+ It is advisable to set <xref linkend="guc-maintenance-work-mem"> to
+ a reasonably large value (but not more than the amount of RAM you can
+ dedicate to the <command>CLUSTER</> operation) before clustering.
</para>
<para>
@@ -150,35 +165,13 @@ CLUSTER [VERBOSE]
</para>
<para>
- There is another way to cluster data. The
- <command>CLUSTER</command> command reorders the original table by
- scanning it using the index you specify. This can be slow
- on large tables because the rows are fetched from the table
- in index order, and if the table is disordered, the
- entries are on random pages, so there is one disk page
- retrieved for every row moved. (<productname>PostgreSQL</productname> has
- a cache, but the majority of a big table will not fit in the cache.)
- The other way to cluster a table is to use:
-
-<programlisting>
-CREATE TABLE <replaceable class="parameter">newtable</replaceable> AS
- SELECT * FROM <replaceable class="parameter">table</replaceable> ORDER BY <replaceable class="parameter">columnlist</replaceable>;
-</programlisting>
-
- which uses the <productname>PostgreSQL</productname> sorting code
- to produce the desired order;
- this is usually much faster than an index scan for disordered data.
- Then you drop the old table, use
- <command>ALTER TABLE ... RENAME</command>
- to rename <replaceable class="parameter">newtable</replaceable> to the
- old name, and recreate the table's indexes.
- The big disadvantage of this approach is that it does not preserve
- OIDs, constraints, foreign key relationships, granted privileges, and
- other ancillary properties of the table &mdash; all such items must be
- manually recreated. Another disadvantage is that this way requires a sort
- temporary file about the same size as the table itself, so peak disk usage
- is about three times the table size instead of twice the table size.
+ Because <command>CLUSTER</command> remembers which indexes are clustered,
+ one can cluster the tables one wants clustered manually the first time,
+ then set up a periodic maintenance script that executes
+ <command>CLUSTER</> without any parameters, so that the desired tables
+ are periodically reclustered.
</para>
+
</refsect1>
<refsect1>