doc/src/sgml/unaccent.sgml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.7 2010/08/25 21:42:55 tgl Exp $ -->

<sect1 id="unaccent">
 <title>unaccent</title>

 <indexterm zone="unaccent">
  <primary>unaccent</primary>
 </indexterm>

 <para>
  <filename>unaccent</> is a text search dictionary that removes accents
  (diacritic signs) from lexemes.
  It's a filtering dictionary, which means its output is
  always passed to the next dictionary (if any), unlike the normal
  behavior of dictionaries.  This allows accent-insensitive processing
  for full text search.
 </para>

 <para>
  The current implementation of <filename>unaccent</> cannot be used as a
  normalizing dictionary for the <filename>thesaurus</filename> dictionary.
 </para>

 <sect2>
  <title>Configuration</title>

  <para>
   An <literal>unaccent</> dictionary accepts the following options:
  </para>
  <itemizedlist>
   <listitem>
    <para>
     <literal>RULES</> is the base name of the file containing the list of
     translation rules.  This file must be stored in
     <filename>$SHAREDIR/tsearch_data/</> (where <literal>$SHAREDIR</> means
     the <productname>PostgreSQL</> installation's shared-data directory).
     Its name must end in <literal>.rules</> (which is not to be included in
     the <literal>RULES</> parameter).
    </para>
   </listitem>
  </itemizedlist>
  <para>
   The rules file has the following format:
  </para>
  <itemizedlist>
   <listitem>
    <para>
     Each line represents a pair, consisting of a character with accent
     followed by a character without accent.  The first is translated into
     the second.  For example,
<programlisting>
&Agrave;        A
&Aacute;        A
&Acirc;        A
&Atilde;        A
&Auml;        A
&Aring;        A
&AElig;        A
</programlisting>
    </para>
   </listitem>
  </itemizedlist>

  <para>
   A more complete example, which is directly useful for most European
   languages, can be found in <filename>unaccent.rules</>, which is installed
   in <filename>$SHAREDIR/tsearch_data/</> when the <filename>unaccent</>
   module is installed.
  </para>
 </sect2>

 <sect2>
  <title>Usage</title>

  <para>
   Running the installation script <filename>unaccent.sql</> creates a text
   search template <literal>unaccent</> and a dictionary <literal>unaccent</>
   based on it.  The <literal>unaccent</> dictionary has the default
   parameter setting <literal>RULES='unaccent'</>, which makes it immediately
   usable with the standard <filename>unaccent.rules</> file.
   If you wish, you can alter the parameter, for example

<programlisting>
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
</programlisting>

   or create new dictionaries based on the template.
  </para>

  <para>
   To test the dictionary, you can try:
<programlisting>
mydb=# select ts_lexize('unaccent','H&ocirc;tel');
 ts_lexize
-----------
 {Hotel}
(1 row)
</programlisting>
  </para>

  <para>
   Here is an example showing how to insert the
   <filename>unaccent</> dictionary into a text search configuration:
<programlisting>
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','H&ocirc;tels de la Mer');
    to_tsvector
-------------------
 'hotel':1 'mer':4
(1 row)

mydb=# select to_tsvector('fr','H&ocirc;tel de la Mer') @@ to_tsquery('fr','Hotels');
 ?column?
----------
 t
(1 row)

mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels'));
      ts_headline
------------------------
 &lt;b&gt;H&ocirc;tel&lt;/b&gt; de la Mer
(1 row)
</programlisting>
  </para>
 </sect2>

 <sect2>
 <title>Functions</title>

 <para>
  The <function>unaccent()</> function removes accents (diacritic signs) from
  a given string.  Basically, it's a wrapper around the
  <filename>unaccent</> dictionary, but it can be used outside normal
  text search contexts.
 </para>

 <indexterm>
  <primary>unaccent</primary>
 </indexterm>

<synopsis>
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>
</synopsis>

 <para>
  For example:
<programlisting>
SELECT unaccent('unaccent', 'H&ocirc;tel');
SELECT unaccent('H&ocirc;tel');
</programlisting>
 </para>
 </sect2>

</sect1>