diff options
author | Tom Lane | 2010-08-25 21:42:55 +0000 |
---|---|---|
committer | Tom Lane | 2010-08-25 21:42:55 +0000 |
commit | 9389ac8928866eb4ab19b2f3892531e798e34f24 (patch) | |
tree | c2dccdc27600682949e47fefcea0c2d96afbd6f0 /doc/src | |
parent | acac35adca6e039e2288c5253079b128c1742b5e (diff) |
Document filtering dictionaries in textsearch.sgml.
While at it, copy-edit the description of prefix-match marker support in
synonym dictionaries, and clarify the description of the default unaccent
dictionary a bit more.
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/textsearch.sgml | 126 | ||||
-rw-r--r-- | doc/src/sgml/unaccent.sgml | 8 |
2 files changed, 79 insertions, 55 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index fb7f2050917..60fac102df7 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.58 2010/08/20 13:59:45 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.59 2010/08/25 21:42:55 tgl Exp $ --> <chapter id="textsearch"> <title>Full Text Search</title> @@ -112,7 +112,7 @@ as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for <firstterm>proximity ranking</firstterm>, so that a document that - contains a more <quote>dense</> region of query words is + contains a more <quote>dense</> region of query words is assigned a higher rank than one with scattered query words. </para> </listitem> @@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... " <screen> SELECT ts_headline('english', 'The most common type of search -is to find all documents containing given query terms +is to find all documents containing given query terms and return them in order of their similarity to the query.', to_tsquery('query & similarity')); ts_headline ------------------------------------------------------------ - containing given <b>query</b> terms + containing given <b>query</b> terms and return them in order of their <b>similarity</b> to the <b>query</b>. @@ -1166,7 +1166,7 @@ SELECT ts_headline('english', is to find all documents containing given query terms and return them in order of their similarity to the query.', - to_tsquery('query & similarity'), + to_tsquery('query & similarity'), 'StartSel = <, StopSel = >'); ts_headline ------------------------------------------------------- @@ -2066,6 +2066,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h </listitem> <listitem> <para> + a single lexeme with the <literal>TSL_FILTER</> flag set, to replace + the original token with a new token to be passed to subsequent + dictionaries (a dictionary that does this is called a + <firstterm>filtering dictionary</>) + </para> + </listitem> + <listitem> + <para> an empty array if the dictionary knows the token, but it is a stop word </para> </listitem> @@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. + Normally, the first dictionary that returns a non-<literal>NULL</> + output determines the result, and any remaining dictionaries are not + consulted; but a filtering dictionary can replace the given word + with a modified word, which is then passed to subsequent dictionaries. + </para> + + <para> The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like @@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en </programlisting> </para> + <para> + A filtering dictionary can be placed anywhere in the list, except at the + end where it'd be useless. Filtering dictionaries are useful to partially + normalize words to simplify the task of later dictionaries. For example, + a filtering dictionary could be used to remove accents from accented + letters, as is done by the + <link linkend="unaccent"><filename>contrib/unaccent</></link> + extension module. + </para> + <sect2 id="textsearch-stopwords"> <title>Stop Words</title> @@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict ( Here, <literal>english</literal> is the base name of a file of stop words. The file's full name will be <filename>$SHAREDIR/tsearch_data/english.stop</>, - where <literal>$SHAREDIR</> means the + where <literal>$SHAREDIR</> means the <productname>PostgreSQL</productname> installation's shared-data directory, often <filename>/usr/local/share/postgresql</> (use <command>pg_config --sharedir</> to determine it if you're not sure). @@ -2295,17 +2320,39 @@ SELECT * FROM ts_debug('english', 'Paris'); asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris} </screen> </para> - + <para> - An asterisk (<literal>*</literal>) at the end of definition word indicates - that definition word is a prefix, and <function>to_tsquery()</function> - function will transform that definition to the prefix search format (see - <xref linkend="textsearch-parsing-queries">). - Notice that it is ignored in <function>to_tsvector()</function>. + The only parameter required by the <literal>synonym</> template is + <literal>SYNONYMS</>, which is the base name of its configuration file + — <literal>my_synonyms</> in the above example. + The file's full name will be + <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</> + (where <literal>$SHAREDIR</> means the + <productname>PostgreSQL</> installation's shared-data directory). + The file format is just one line + per word to be substituted, with the word followed by its synonym, + separated by white space. Blank lines and trailing spaces are ignored. + </para> + + <para> + The <literal>synonym</> template also has an optional parameter + <literal>CaseSensitive</>, which defaults to <literal>false</>. When + <literal>CaseSensitive</> is <literal>false</>, words in the synonym file + are folded to lower case, as are input tokens. When it is + <literal>true</>, words and tokens are not folded to lower case, + but are compared as-is. </para> <para> - Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>: + An asterisk (<literal>*</literal>) can be placed at the end of a synonym + in the configuration file. This indicates that the synonym is a prefix. + The asterisk is ignored when the entry is used in + <function>to_tsvector()</function>, but when it is used in + <function>to_tsquery()</function>, the result will be a query item with + the prefix match marker (see + <xref linkend="textsearch-parsing-queries">). + For example, suppose we have these entries in + <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>: <programlisting> postgres pgsql postgresql pgsql @@ -2313,67 +2360,42 @@ postgre pgsql gogle googl indices index* </programlisting> - </para> - - <para> - Results: + Then we will get these results: <screen> -=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); -=# SELECT ts_lexize('syn','indices'); +mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); +mydb=# SELECT ts_lexize('syn','indices'); ts_lexize ----------- {index} (1 row) -=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); -=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; -=# SELECT to_tsquery('tst','indices'); +mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); +mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; +mydb=# SELECT to_tsvector('tst','indices'); + to_tsvector +------------- + 'index':1 +(1 row) + +mydb=# SELECT to_tsquery('tst','indices'); to_tsquery ------------ 'index':* (1 row) -=# SELECT 'indexes are very useful'::tsvector; +mydb=# SELECT 'indexes are very useful'::tsvector; tsvector --------------------------------- 'are' 'indexes' 'useful' 'very' (1 row) -=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); +mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); ?column? ---------- t (1 row) - -=# SELECT to_tsvector('tst','indices'); - to_tsvector -------------- - 'index':1 -(1 row) </screen> </para> - - <para> - The only parameter required by the <literal>synonym</> template is - <literal>SYNONYMS</>, which is the base name of its configuration file - — <literal>my_synonyms</> in the above example. - The file's full name will be - <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</> - (where <literal>$SHAREDIR</> means the - <productname>PostgreSQL</> installation's shared-data directory). - The file format is just one line - per word to be substituted, with the word followed by its synonym, - separated by white space. Blank lines and trailing spaces are ignored. - </para> - - <para> - The <literal>synonym</> template also has an optional parameter - <literal>CaseSensitive</>, which defaults to <literal>false</>. When - <literal>CaseSensitive</> is <literal>false</>, words in the synonym file - are folded to lower case, as are input tokens. When it is - <literal>true</>, words and tokens are not folded to lower case, - but are compared as-is. - </para> </sect2> <sect2 id="textsearch-thesaurus"> diff --git a/doc/src/sgml/unaccent.sgml b/doc/src/sgml/unaccent.sgml index 6c73c3f2986..135fcdb6dc6 100644 --- a/doc/src/sgml/unaccent.sgml +++ b/doc/src/sgml/unaccent.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.6 2010/08/25 02:12:00 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.7 2010/08/25 21:42:55 tgl Exp $ --> <sect1 id="unaccent"> <title>unaccent</title> @@ -75,8 +75,10 @@ <para> Running the installation script <filename>unaccent.sql</> creates a text search template <literal>unaccent</> and a dictionary <literal>unaccent</> - based on it, with default parameters. You can alter the - parameters, for example + based on it. The <literal>unaccent</> dictionary has the default + parameter setting <literal>RULES='unaccent'</>, which makes it immediately + usable with the standard <filename>unaccent.rules</> file. + If you wish, you can alter the parameter, for example <programlisting> mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules'); |