1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
|
<sect1 id="unaccent">
<title>unaccent</title>
<indexterm zone="unaccent">
<primary>unaccent</primary>
</indexterm>
<para>
<filename>unaccent</> removes accents (diacritic signs) from a lexeme.
It's a filtering dictionary, that means its output is
always passed to the next dictionary (if any), contrary to the standard
behavior. Currently, it supports most important accents from european
languages.
</para>
<para>
Limitation: Current implementation of <filename>unaccent</>
dictionary cannot be used as a normalizing dictionary for
<filename>thesaurus</filename> dictionary.
</para>
<sect2>
<title>Configuration</title>
<para>
A <literal>unaccent</> dictionary accepts the following options:
</para>
<itemizedlist>
<listitem>
<para>
<literal>RULES</> is the base name of the file containing the list of
translation rules. This file must be stored in
<filename>$SHAREDIR/tsearch_data/</> (where <literal>$SHAREDIR</> means
the <productname>PostgreSQL</> installation's shared-data directory).
Its name must end in <literal>.rules</> (which is not to be included in
the <literal>RULES</> parameter).
</para>
</listitem>
</itemizedlist>
<para>
The rules file has the following format:
</para>
<itemizedlist>
<listitem>
<para>
Each line represents pair: character_with_accent character_without_accent
<programlisting>
À A
Á A
 A
à A
Ä A
Å A
Æ A
</programlisting>
</para>
</listitem>
</itemizedlist>
<para>
Look at <filename>unaccent.rules</>, which is installed in
<filename>$SHAREDIR/tsearch_data/</>, for an example.
</para>
</sect2>
<sect2>
<title>Usage</title>
<para>
Running the installation script creates a text search template
<literal>unaccent</> and a dictionary <literal>unaccent</>
based on it, with default parameters. You can alter the
parameters, for example
<programlisting>
=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
</programlisting>
or create new dictionaries based on the template.
</para>
<para>
To test the dictionary, you can try
<programlisting>
=# select ts_lexize('unaccent','Hôtel');
ts_lexize
-----------
{Hotel}
(1 row)
</programlisting>
</para>
<para>
Filtering dictionary are useful for correct work of
<function>ts_headline</function> function.
<programlisting>
=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
=# ALTER TEXT SEARCH CONFIGURATION fr
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, french_stem;
=# select to_tsvector('fr','Hôtels de la Mer');
to_tsvector
-------------------
'hotel':1 'mer':4
(1 row)
=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
?column?
----------
t
(1 row)
=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
ts_headline
------------------------
<b>Hôtel</b>de la Mer
(1 row)
</programlisting>
</para>
</sect2>
<sect2>
<title>Function</title>
<para>
<function>unaccent</> function removes accents (diacritic signs) from
argument string. Basically, it's a wrapper around
<filename>unaccent</> dictionary.
</para>
<indexterm>
<primary>unaccent</primary>
</indexterm>
<synopsis>
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>,
</optional> <replaceable class="PARAMETER">string</replaceable>)
returns <type>text</type>
</synopsis>
<para>
<programlisting>
SELECT unaccent('unaccent','Hôtel');
SELECT unaccent('Hôtel');
</programlisting>
</para>
</sect2>
</sect1>
|