Age | Commit message (Collapse) | Author |
|
Previously we searched for code points where the Unicode data file
listed an equivalent combining character sequence that added accents.
Some codepoints redirect to a single other codepoint, instead of doing
any combining. We can follow those references recursively to get the
answer.
Per bug report #18362, which reported missing Ancient Greek characters.
Specifically, precomposed characters with oxia (from the polytonic
accent system used for old Greek) just point to precomposed characters
with tonos (from the monotonic accent system for modern Greek), and we
have to follow the extra hop to find out that they are composed with
an acute accent.
Besides those, the new rule also:
* pulls in a lot of 'Mathematical Alphanumeric Symbols', which are
copies of the Latin and Greek alphabets and numbers rendered
in different typefaces, and
* corrects a single mathematical letter that previously came from the
CLDR transliteration file, but the new rule extracts from the main
Unicode database file, where clearly the latter is right and the
former is a wrong (reported to CLDR).
Reported-by: Cees van Zeeland <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Reviewed-by: Michael Paquier <[email protected]>
Discussion: https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org
|
|
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/[email protected]
|
|
Apparently, the previous update
(2e0e0666790e48cec716d4947f89d067ef53490c) must have used a stale
input file and missed a few additions that were added shortly before
the CLDR release. Update this now so that the next update really only
changes things new in that version.
|
|
|
|
|
|
Strip certain classes of combining characters, so that accents encoded
this way are removed.
Author: Hugh Ranalli
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org
|
|
This has required an update of the python script generating the rules,
as its format has changed in release 29. This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them. The way to find newest versions of Latin-ASCII gets
also more clearly documented.
Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/[email protected]
|
|
Author: Tasos Maschalidis
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/153495048900.1368.11566580687623014380%40wrigleys.postgresql.org
Discussion: https://postgr.es/m/VI1PR01MB38537EBD529FE5EE3FE9A5FEB5370%40VI1PR01MB3853.eurprd01.prod.exchangelabs.com
|
|
Improve generate_unaccent_rules.py to handle composed characters whose base
is another composed character rather than a plain letter. The net effect
of this is to add a bunch of multi-accented Vietnamese characters to
unaccent.rules.
Original complaint from Kha Nguyen, diagnosis of the script's shortcoming
by Thomas Munro.
Dang Minh Huong and Michael Paquier
Discussion: https://postgr.es/m/CALo3sF6EC8cy1F2JUz=GRf5h4LMUJTaG3qpdoiLrNbWEXL-tRg@mail.gmail.com
|
|
Script now use the standard Unicode transliterator Latin-ASCII.
Author: Leonard Benedetti
|
|
correctly
Add Python script for buiding unaccent.rules from Unicode data. Don't
backpatch because unaccent changes may require tsvector/index
rebuild.
Thomas Munro <[email protected]>
|
|
|