Revise hash join and hash aggregation code to use the same datatype-

specific hash functions used by hash indexes, rather than the old not-datatype-aware ComputeHashFunc routine. This makes it safe to do hash joining on several datatypes that previously couldn't use hashing. The sets of datatypes that are hash indexable and hash joinable are now exactly the same, whereas before each had some that weren't in the other.
author: Tom Lane 2003-06-22 22:04:55 +0000
committer: Tom Lane 2003-06-22 22:04:55 +0000
commit: bff0422b6c8f65b2f8210d8690a7f63f8d6e2782 (patch)
tree: a3ec649b7c6251efdae2be1b923462979ad7184e /doc/src
parent: 0dda75f6eb4bb9d65a7c2ad729fbf21d616c1bb1 (diff)
3 files changed, 30 insertions, 43 deletions
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index a8f7190856c..835739d81bb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1,6 +1,6 @@
 <!--
  Documentation of the system catalogs, directed toward PostgreSQL developers
- $Header: /cvsroot/pgsql/doc/src/sgml/catalogs.sgml,v 2.71 2003/05/28 16:03:55 tgl Exp $
+ $Header: /cvsroot/pgsql/doc/src/sgml/catalogs.sgml,v 2.72 2003/06/22 22:04:54 tgl Exp $
  -->
 
 <chapter id="catalogs">
@@ -2525,7 +2525,7 @@
       <entry><structfield>oprcanhash</structfield></entry>
       <entry><type>bool</type></entry>
       <entry></entry>
-      <entry>This operator supports hash joins.</entry>
+      <entry>This operator supports hash joins</entry>
      </row>
 
      <row>
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index f6298a0ecca..b64aa011138 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/xfunc.sgml,v 1.68 2003/05/29 20:40:36 tgl Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/xfunc.sgml,v 1.69 2003/06/22 22:04:54 tgl Exp $
 -->
 
  <sect1 id="xfunc">
@@ -1442,11 +1442,10 @@ concat_text(PG_FUNCTION_ARGS)
       <listitem>
        <para>
         Always zero the bytes of your structures using
-        <function>memset</function> or <function>bzero</function>.
-        Several routines (such as the hash access method, hash joins,
-        and the sort algorithm) compute functions of the raw bits
-        contained in your structure.  Even if you initialize all
-        fields of your structure, there may be several bytes of
+	<function>memset</function>.  Without this, it's difficult to
+	support hash indexes or hash joins, as you must pick out only
+	the significant bits of your data structure to compute a hash.
+        Even if you initialize all fields of your structure, there may be
         alignment padding (holes in the structure) that may contain
         garbage values.
        </para>
diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml
index 22d214623ba..a2705eb6636 100644
--- a/doc/src/sgml/xoper.sgml
+++ b/doc/src/sgml/xoper.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/xoper.sgml,v 1.23 2003/04/10 01:22:45 petere Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/xoper.sgml,v 1.24 2003/06/22 22:04:54 tgl Exp $
 -->
 
  <sect1 id="xoper">
@@ -315,46 +315,34 @@ table1.column1 OP table2.column2
      same hash code.  If two values get put in different hash buckets, the
      join will never compare them at all, implicitly assuming that the
      result of the join operator must be false.  So it never makes sense
-     to specify <literal>HASHES</literal> for operators that do not represent equality.
+     to specify <literal>HASHES</literal> for operators that do not represent
+     equality.
     </para>
 
     <para>
-     In fact, logical equality is not good enough either; the operator
-     had better represent pure bitwise equality, because the hash
-     function will be computed on the memory representation of the
-     values regardless of what the bits mean.  For example, the
-     polygon operator <literal>~=</literal>, which checks whether two
-     polygons are the same, is not bitwise equality, because two
-     polygons can be considered the same even if their vertices are
-     specified in a different order.  What this means is that a join
-     using <literal>~=</literal> between polygon fields would yield
-     different results if implemented as a hash join than if
-     implemented another way, because a large fraction of the pairs
-     that should match will hash to different values and will never be
-     compared by the hash join.  But if the optimizer chooses to use a
-     different kind of join, all the pairs that the operator
-     <literal>~=</literal> says are the same will be found.  We don't
-     want that kind of inconsistency, so we don't mark the polygon
-     operator <literal>~=</literal> as hashable.
+     To be marked <literal>HASHES</literal>, the join operator must appear
+     in a hash index operator class.  This is not enforced when you create
+     the operator, since of course the referencing operator class couldn't
+     exist yet.  But attempts to use the operator in hash joins will fail
+     at runtime if no such operator class exists.  The system needs the
+     operator class to find the datatype-specific hash function for the
+     operator's input datatype.  Of course, you must also supply a suitable
+     hash function before you can create the operator class.
     </para>
 
     <para>
-     There are also machine-dependent ways in which a hash join might fail
-     to do the right thing.  For example, if your data type
-     is a structure in which there may be uninteresting pad bits, it's unsafe
-     to mark the equality operator <literal>HASHES</>.  (Unless you write
-     your other operators and functions to ensure that the unused bits are always zero, which is the recommended strategy.)
-     Another example is that the floating-point data types are unsafe for hash
-     joins.  On machines that meet the <acronym>IEEE</> floating-point standard, negative
-     zero and positive zero are different values (different bit patterns) but
-     they are defined to compare equal.  So, if the equality operator on floating-point data types were marked
-     <literal>HASHES</>, a negative zero and a positive zero would probably not be matched up
-     by a hash join, but they would be matched up by any other join process.
-    </para>
-
-    <para>
-     The bottom line is that you should probably only use <literal>HASHES</literal> for
-     equality operators that are (or could be) implemented by <function>memcmp()</function>.
+     Care should be exercised when preparing a hash function, because there
+     are machine-dependent ways in which it might fail to do the right thing.
+     For example, if your data type is a structure in which there may be
+     uninteresting pad bits, you can't simply pass the whole structure to
+     <function>hash_any</>.  (Unless you write your other operators and
+     functions to ensure that the unused bits are always zero, which is the
+     recommended strategy.)
+     Another example is that on machines that meet the <acronym>IEEE</>
+     floating-point standard, negative zero and positive zero are different
+     values (different bit patterns) but they are defined to compare equal.
+     If a float value might contain negative zero then extra steps are needed
+     to ensure it generates the same hash value as positive zero.
     </para>
 
     <note>
author	Tom Lane	2003-06-22 22:04:55 +0000
committer	Tom Lane	2003-06-22 22:04:55 +0000
commit	bff0422b6c8f65b2f8210d8690a7f63f8d6e2782 (patch)
tree	a3ec649b7c6251efdae2be1b923462979ad7184e /doc/src
parent	0dda75f6eb4bb9d65a7c2ad729fbf21d616c1bb1 (diff)