summaryrefslogtreecommitdiff
path: root/contrib/unaccent/unaccent.rules
AgeCommit message (Collapse)Author
2024-07-05Add simple codepoint redirections to unaccent.rules.Thomas Munro
Previously we searched for code points where the Unicode data file listed an equivalent combining character sequence that added accents. Some codepoints redirect to a single other codepoint, instead of doing any combining. We can follow those references recursively to get the answer. Per bug report #18362, which reported missing Ancient Greek characters. Specifically, precomposed characters with oxia (from the polytonic accent system used for old Greek) just point to precomposed characters with tonos (from the monotonic accent system for modern Greek), and we have to follow the extra hop to find out that they are composed with an acute accent. Besides those, the new rule also: * pulls in a lot of 'Mathematical Alphanumeric Symbols', which are copies of the Latin and Greek alphabets and numbers rendered in different typefaces, and * corrects a single mathematical letter that previously came from the CLDR transliteration file, but the new rule extracts from the main Unicode database file, where clearly the latter is right and the former is a wrong (reported to CLDR). Reported-by: Cees van Zeeland <[email protected]> Reviewed-by: Robert Haas <[email protected]> Reviewed-by: Peter Eisentraut <[email protected]> Reviewed-by: Michael Paquier <[email protected]> Discussion: https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org
2023-09-20unaccent: Add support for quoted translated charactersMichael Paquier
As reported in bug #18057, the extension unaccent removes in its rule file whitespace characters that are intentionally specified when building unaccent.rules from UnicodeData.txt, causing an incorrect translation for some characters like numeric symbols. This is caused by the fact that all whitespaces before and after the origin and target characters are all discarded (this limitation is documented). This commit makes possible the use of quotes around target characters, so as whitespaces can be considered part of target characters. Some target characters use a double quote, these require an extra double quote. The documentation is updated to show how to use quoted areas, generate_unaccent_rules.py is updated to generate unaccent.rules and a couple of tests are added for numeric symbols. While working on this patch, I have implemented a fake rule file to test the parsing logic implemented, which is not included here as it would just consume extra cycles in the tests, and it requires the manipulation of an installation tree to be able to work correctly. As this requires a change of format in unaccent.rules, this cannot be backpatched, unfortunately. The idea to use double quotes as escaped characters comes from Tom Lane. Reported-by: Martin Schlossarek Author: Michael Paquier Discussion: https://postgr.es/m/[email protected]
2022-03-10Re-update Unicode data to CLDR 39Peter Eisentraut
Apparently, the previous update (2e0e0666790e48cec716d4947f89d067ef53490c) must have used a stale input file and missed a few additions that were added shortly before the CLDR release. Update this now so that the next update really only changes things new in that version.
2021-04-08Update Unicode data to CLDR 39Peter Eisentraut
2020-04-24Update Unicode data to Unicode 13.0.0 and CLDR 37Peter Eisentraut
2019-02-01Add combining characters to unaccent.rules.Thomas Munro
Strip certain classes of combining characters, so that accents encoded this way are removed. Author: Hugh Ranalli Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org
2019-01-10Update unaccent rules with release 34 of CLDR for Latin-ASCII.xmlMichael Paquier
This has required an update of the python script generating the rules, as its format has changed in release 29. This release has also added new punctuation and symbols, and a new set of rules has been generated to include them. The way to find newest versions of Latin-ASCII gets also more clearly documented. Author: Hugh Ranalli, Michael Paquier Discussion: https://postgr.es/m/[email protected]
2018-09-01Add Greek characters to unaccent.rules.Thomas Munro
Author: Tasos Maschalidis Reviewed-by: Michael Paquier, Tom Lane Discussion: https://postgr.es/m/153495048900.1368.11566580687623014380%40wrigleys.postgresql.org Discussion: https://postgr.es/m/VI1PR01MB38537EBD529FE5EE3FE9A5FEB5370%40VI1PR01MB3853.eurprd01.prod.exchangelabs.com
2017-08-16Extend the default rules file for contrib/unaccent with Vietnamese letters.Tom Lane
Improve generate_unaccent_rules.py to handle composed characters whose base is another composed character rather than a plain letter. The net effect of this is to add a bunch of multi-accented Vietnamese characters to unaccent.rules. Original complaint from Kha Nguyen, diagnosis of the script's shortcoming by Thomas Munro. Dang Minh Huong and Michael Paquier Discussion: https://postgr.es/m/CALo3sF6EC8cy1F2JUz=GRf5h4LMUJTaG3qpdoiLrNbWEXL-tRg@mail.gmail.com
2016-03-16Improve script generating unaccent rulesTeodor Sigaev
Script now use the standard Unicode transliterator Latin-ASCII. Author: Leonard Benedetti
2015-09-04Make unaccent handle all diacritics known to Unicode, and expand ligatures ↵Teodor Sigaev
correctly Add Python script for buiding unaccent.rules from Unicode data. Don't backpatch because unaccent changes may require tsvector/index rebuild. Thomas Munro <[email protected]>
2009-08-18Unaccent dictionary.Teodor Sigaev