summaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
authorTom Lane2010-08-25 21:42:55 +0000
committerTom Lane2010-08-25 21:42:55 +0000
commit9389ac8928866eb4ab19b2f3892531e798e34f24 (patch)
treec2dccdc27600682949e47fefcea0c2d96afbd6f0 /doc/src
parentacac35adca6e039e2288c5253079b128c1742b5e (diff)
Document filtering dictionaries in textsearch.sgml.
While at it, copy-edit the description of prefix-match marker support in synonym dictionaries, and clarify the description of the default unaccent dictionary a bit more.
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/textsearch.sgml126
-rw-r--r--doc/src/sgml/unaccent.sgml8
2 files changed, 79 insertions, 55 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index fb7f2050917..60fac102df7 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.58 2010/08/20 13:59:45 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.59 2010/08/25 21:42:55 tgl Exp $ -->
<chapter id="textsearch">
<title>Full Text Search</title>
@@ -112,7 +112,7 @@
as a sorted array of normalized lexemes. Along with the lexemes it is
often desirable to store positional information to use for
<firstterm>proximity ranking</firstterm>, so that a document that
- contains a more <quote>dense</> region of query words is
+ contains a more <quote>dense</> region of query words is
assigned a higher rank than one with scattered query words.
</para>
</listitem>
@@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... "
<screen>
SELECT ts_headline('english',
'The most common type of search
-is to find all documents containing given query terms
+is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query &amp; similarity'));
ts_headline
------------------------------------------------------------
- containing given &lt;b&gt;query&lt;/b&gt; terms
+ containing given &lt;b&gt;query&lt;/b&gt; terms
and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
&lt;b&gt;query&lt;/b&gt;.
@@ -1166,7 +1166,7 @@ SELECT ts_headline('english',
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
- to_tsquery('query &amp; similarity'),
+ to_tsquery('query &amp; similarity'),
'StartSel = &lt;, StopSel = &gt;');
ts_headline
-------------------------------------------------------
@@ -2066,6 +2066,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
</listitem>
<listitem>
<para>
+ a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
+ the original token with a new token to be passed to subsequent
+ dictionaries (a dictionary that does this is called a
+ <firstterm>filtering dictionary</>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
an empty array if the dictionary knows the token, but it is a stop word
</para>
</listitem>
@@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for.
+ Normally, the first dictionary that returns a non-<literal>NULL</>
+ output determines the result, and any remaining dictionaries are not
+ consulted; but a filtering dictionary can replace the given word
+ with a modified word, which is then passed to subsequent dictionaries.
+ </para>
+
+ <para>
The general rule for configuring a list of dictionaries
is to place first the most narrow, most specific dictionary, then the more
general dictionaries, finishing with a very general dictionary, like
@@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en
</programlisting>
</para>
+ <para>
+ A filtering dictionary can be placed anywhere in the list, except at the
+ end where it'd be useless. Filtering dictionaries are useful to partially
+ normalize words to simplify the task of later dictionaries. For example,
+ a filtering dictionary could be used to remove accents from accented
+ letters, as is done by the
+ <link linkend="unaccent"><filename>contrib/unaccent</></link>
+ extension module.
+ </para>
+
<sect2 id="textsearch-stopwords">
<title>Stop Words</title>
@@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict (
Here, <literal>english</literal> is the base name of a file of stop words.
The file's full name will be
<filename>$SHAREDIR/tsearch_data/english.stop</>,
- where <literal>$SHAREDIR</> means the
+ where <literal>$SHAREDIR</> means the
<productname>PostgreSQL</productname> installation's shared-data directory,
often <filename>/usr/local/share/postgresql</> (use <command>pg_config
--sharedir</> to determine it if you're not sure).
@@ -2295,17 +2320,39 @@ SELECT * FROM ts_debug('english', 'Paris');
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
</screen>
</para>
-
+
<para>
- An asterisk (<literal>*</literal>) at the end of definition word indicates
- that definition word is a prefix, and <function>to_tsquery()</function>
- function will transform that definition to the prefix search format (see
- <xref linkend="textsearch-parsing-queries">).
- Notice that it is ignored in <function>to_tsvector()</function>.
+ The only parameter required by the <literal>synonym</> template is
+ <literal>SYNONYMS</>, which is the base name of its configuration file
+ &mdash; <literal>my_synonyms</> in the above example.
+ The file's full name will be
+ <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
+ (where <literal>$SHAREDIR</> means the
+ <productname>PostgreSQL</> installation's shared-data directory).
+ The file format is just one line
+ per word to be substituted, with the word followed by its synonym,
+ separated by white space. Blank lines and trailing spaces are ignored.
+ </para>
+
+ <para>
+ The <literal>synonym</> template also has an optional parameter
+ <literal>CaseSensitive</>, which defaults to <literal>false</>. When
+ <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
+ are folded to lower case, as are input tokens. When it is
+ <literal>true</>, words and tokens are not folded to lower case,
+ but are compared as-is.
</para>
<para>
- Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
+ An asterisk (<literal>*</literal>) can be placed at the end of a synonym
+ in the configuration file. This indicates that the synonym is a prefix.
+ The asterisk is ignored when the entry is used in
+ <function>to_tsvector()</function>, but when it is used in
+ <function>to_tsquery()</function>, the result will be a query item with
+ the prefix match marker (see
+ <xref linkend="textsearch-parsing-queries">).
+ For example, suppose we have these entries in
+ <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
<programlisting>
postgres pgsql
postgresql pgsql
@@ -2313,67 +2360,42 @@ postgre pgsql
gogle googl
indices index*
</programlisting>
- </para>
-
- <para>
- Results:
+ Then we will get these results:
<screen>
-=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
-=# SELECT ts_lexize('syn','indices');
+mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
+mydb=# SELECT ts_lexize('syn','indices');
ts_lexize
-----------
{index}
(1 row)
-=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
-=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
-=# SELECT to_tsquery('tst','indices');
+mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
+mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
+mydb=# SELECT to_tsvector('tst','indices');
+ to_tsvector
+-------------
+ 'index':1
+(1 row)
+
+mydb=# SELECT to_tsquery('tst','indices');
to_tsquery
------------
'index':*
(1 row)
-=# SELECT 'indexes are very useful'::tsvector;
+mydb=# SELECT 'indexes are very useful'::tsvector;
tsvector
---------------------------------
'are' 'indexes' 'useful' 'very'
(1 row)
-=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
+mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column?
----------
t
(1 row)
-
-=# SELECT to_tsvector('tst','indices');
- to_tsvector
--------------
- 'index':1
-(1 row)
</screen>
</para>
-
- <para>
- The only parameter required by the <literal>synonym</> template is
- <literal>SYNONYMS</>, which is the base name of its configuration file
- &mdash; <literal>my_synonyms</> in the above example.
- The file's full name will be
- <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
- (where <literal>$SHAREDIR</> means the
- <productname>PostgreSQL</> installation's shared-data directory).
- The file format is just one line
- per word to be substituted, with the word followed by its synonym,
- separated by white space. Blank lines and trailing spaces are ignored.
- </para>
-
- <para>
- The <literal>synonym</> template also has an optional parameter
- <literal>CaseSensitive</>, which defaults to <literal>false</>. When
- <literal>CaseSensitive</> is <literal>false</>, words in the synonym file
- are folded to lower case, as are input tokens. When it is
- <literal>true</>, words and tokens are not folded to lower case,
- but are compared as-is.
- </para>
</sect2>
<sect2 id="textsearch-thesaurus">
diff --git a/doc/src/sgml/unaccent.sgml b/doc/src/sgml/unaccent.sgml
index 6c73c3f2986..135fcdb6dc6 100644
--- a/doc/src/sgml/unaccent.sgml
+++ b/doc/src/sgml/unaccent.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.6 2010/08/25 02:12:00 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.7 2010/08/25 21:42:55 tgl Exp $ -->
<sect1 id="unaccent">
<title>unaccent</title>
@@ -75,8 +75,10 @@
<para>
Running the installation script <filename>unaccent.sql</> creates a text
search template <literal>unaccent</> and a dictionary <literal>unaccent</>
- based on it, with default parameters. You can alter the
- parameters, for example
+ based on it. The <literal>unaccent</> dictionary has the default
+ parameter setting <literal>RULES='unaccent'</>, which makes it immediately
+ usable with the standard <filename>unaccent.rules</> file.
+ If you wish, you can alter the parameter, for example
<programlisting>
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');