BUG #15548: Unaccent does not remove combining diacritical characters

Lists: pgsql-bugspgsql-hackers
From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: hugh(at)whtc(dot)ca
Subject: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-12 20:00:45
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 15548
Logged by: Hugh Ranalli
Email address: hugh(at)whtc(dot)ca
PostgreSQL version: 11.1
Operating system: Ubuntu 18.04
Description:

Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks) in which case
the mark applies itself to the preceding character. For example, A followed
by U+0300 displays À. However, unaccent is not removing these accents.

SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in
'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL
repositories. I've read bug report #13440, and have tried with both the
installed unaccent.rules as well as a new set generated by the
generate_unaccent_rules.py distributed with the 11.1 source code:
wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt
wget
https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml
python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules

I see there have been some updates to generate_unaccent_rules.py to handle
Greek and Vietnamese characters, but neither of them seem to address this
issue. I'm happy to contribute a patch to handle these cases, but of course
wanted to make sure this is desired behaviour, or if I am misunderstanding
something somewhere.

Thank you,
Hugh Ranalli


From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: hugh(at)whtc(dot)ca,pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-13 13:19:51
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

PG Bug reporting form wrote:

> Apparently Unicode has two ways of accenting a character: as a separate code
> point, which represents the base character and the accent, or as a
> "combining diacritical mark"
> (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence

In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.

> the mark applies itself to the preceding character. For example, A
> followed by U+0300 displays À. However, unaccent is not removing
> these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
Cc: hugh(at)whtc(dot)ca, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-13 15:05:42
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

"Daniel Verite" <daniel(at)manitou-mail(dot)org> writes:
> PG Bug reporting form wrote:
>> ... For example, A
>> followed by U+0300 displays À. However, unaccent is not removing
>> these accents.

> Short of having the input normalized by the application, ISTM that the
> best solution would be to provide functions to do it in Postgres, so
> you'd just write for example:
> unaccent(unicode_NFC(string))

That might be worthwhile, but it seems independent of this issue.

> Otherwise unaccent.rules can be customized. You may add replacements
> for letter+diacritical sequences that are missing for the languages
> you have to deal with. But doing it in general for all diacriticals
> multiplied by all base characters seems unrealistic.

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

regards, tom lane


From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: hugh(at)whtc(dot)ca,pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-13 16:26:48
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Tom Lane wrote:

> Hm, I thought the OP's proposal was just to make unaccent drop
> combining diacriticals independently of context, which'd avoid the
> combinatorial-growth problem.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"

Incidentally we may want to improve this bit of doc to mention
explicitly the Unicode decomposed forms as a use case for
removing characters. In fact I wonder if that's not what it's
already trying to express, but confusing "languages" with "forms".

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-13 18:50:37
Message-ID: CAAhbUMOHkoN3Jeti4dp1jz3VY=XZPcCqpX=sW=mgmJbdMS--ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel(at)manitou-mail(dot)org wrote:

> Tom Lane wrote:
>
> > Hm, I thought the OP's proposal was just to make unaccent drop
> > combining diacriticals independently of context, which'd avoid the
> > combinatorial-growth problem.
>

That's what I was thinking. Given that the accent is separate from the
characters, simply dropping it should result in the correct unaccented
character.

>
> In that case, this could be achieved by simply appending the
> diacriticals themselves to unaccent.rules, since replacement of a
> string by an empty string is already supported as a rule.
> It doesn't seem like the current file has any of these, but from
> https://www.postgresql.org/docs/11/unaccent.html :
>
> "Alternatively, if only one character is given on a line, instances
> of that character are deleted; this is useful in languages where
> accents are represented by separate characters"
>

Yes, I had read that in the docs, and that's the approach I planned to
take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh

>


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-14 22:42:05
Message-ID: CAAhbUMNqJXTN+_vYdi5L4CLjoq9OCG29V597RKrCQ7xKsCAejA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.

I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh(at)whtc(dot)ca
c: +01-416-994-7957
w: www.whtc.ca

On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:

>
>
> On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel(at)manitou-mail(dot)org wrote:
>
>> Tom Lane wrote:
>>
>> > Hm, I thought the OP's proposal was just to make unaccent drop
>> > combining diacriticals independently of context, which'd avoid the
>> > combinatorial-growth problem.
>>
>
> That's what I was thinking. Given that the accent is separate from the
> characters, simply dropping it should result in the correct unaccented
> character.
>
>>
>> In that case, this could be achieved by simply appending the
>> diacriticals themselves to unaccent.rules, since replacement of a
>> string by an empty string is already supported as a rule.
>> It doesn't seem like the current file has any of these, but from
>> https://www.postgresql.org/docs/11/unaccent.html :
>>
>> "Alternatively, if only one character is given on a line, instances
>> of that character are deleted; this is useful in languages where
>> accents are represented by separate characters"
>>
>
> Yes, I had read that in the docs, and that's the approach I planned to
> take. I'll go ahead and develop a patch, then.
>
> Best wishes,
> Hugh
>
>>

Attachment Content-Type Size
remove-combining-diacritical-accents-in-unaccent.rules.patch text/x-patch 2.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-14 22:50:03
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> I've attached a patch removes combining diacriticals. As with Latin and
> Greek letters, it uses ranges to restrict its activity.

Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

> I have not submitted a patch for unaccent.rules, as it seems that a rules
> file generated from generate_unaccent_rules.py will actually remove a large
> number of rules (even before my changes), such as replacing the copyright
> symbol © with (C), as well as other accented characters. It's probably
> worth asking if the shipped unaccent.rules should correspond to what the
> shipped generation utility produces, or not. I was surprised to see that it
> didn't.

Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?

regards, tom lane


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 18:08:00
Message-ID: CAAhbUMN1n=ZVns-OeCbaVRYPS0oj7tTnmJrzw7Az-op4DHC+JA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> Cool. Please add it to the current CF so we don't forget about it:
> https://commitfest.postgresql.org/21/

Done.

> Me too -- seems like that bears looking into. Perhaps the script's
> results are platform dependent -- what were you testing on?
>
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.

In looking more closely, I also see that script isn't generating ligatures,
even though it should, because although the program can generate them, none
of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so
they are skipped.

This could probably be handled by adding the ligature ranges to the defined
ranges. Symbol types could be added to the types it looks at, and perhaps
the codepoint ranges collapsed into one list, as the IDs are unique across
all categories. I don't think we'd want to just rely on ranges, as that
could include control characters, punctuation, etc.

There are a number of other characters that appear in unaccent.rules that
aren't generated by the script. I've attached a diff of the output of
generate_unaccent_rules (using the version before my changes, to simplify
matters) and unaccent.rules. Unfortunately, I don't know how to interpret
most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an
"unaccent" function should do. Given that it's in the existing rules file,
should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Attachment Content-Type Size
unaccent.diff text/x-patch 5.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 18:44:48
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Me too -- seems like that bears looking into. Perhaps the script's
>> results are platform dependent -- what were you testing on?

> I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
> that's it. The program's decisions come from the two data files, the
> Unicode data set and the Latin-ASCII transliteration file. The script uses
> categories (
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
> to identify letters (and now combining marks) and if they are in range,
> performs a substitution. It then uses the transliteration file to find
> rules for particular character substitutions (for example, that file seems
> to handle the copyright symbol substitution). I don't see anything platform
> dependent in there.

Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 19:03:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I wrote:
> ... I get something that's bit-for-bit the same as what's in unaccent.rules.
> So there's clearly a platform difference between here and there.
> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
> it on anything newer.

A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the
expected results, while python 3.6.6 fails with "SyntaxError: invalid
syntax".

So updating that script to also work with python3 might be a worthwhile
TODO item. But I'm at a loss to explain why you get different results.

regards, tom lane


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, thomas(dot)munro(at)enterprisedb(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 19:05:07
Message-ID: CAAhbUMO+fTZTwigDJ=tB3qFvHMe-xAJO5QpsFPH6Vb2oDYAU6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hm. Something funny is going on here. When I fetch the two reference
> files from the URLs cited in the script, and do
>

> python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
> --latin-ascii-file Latin-ASCII.xml >newrules
>
> I get something that's bit-for-bit the same as what's in unaccent.rules.
> So there's clearly a platform difference between here and there.
>
> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
> it on anything newer.
>
Well, that's embarrassing. When I looked I couldn't see anything that
looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
We use other versions of 2.7 on our production platforms. I'll take another
look, and check the URLs I am using.


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, thomas(dot)munro(at)enterprisedb(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 21:03:33
Message-ID: CAAhbUMMmXnj0YSD+fr5hSqeC+D6PAG+0kXJwMMhK2DCdwQVoxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:

> On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Hm. Something funny is going on here. When I fetch the two reference
>> files from the URLs cited in the script, and do
>>
>
>> python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
>> --latin-ascii-file Latin-ASCII.xml >newrules
>>
>> I get something that's bit-for-bit the same as what's in unaccent.rules.
>> So there's clearly a platform difference between here and there.
>>
>> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
>> it on anything newer.
>>
> Well, that's embarrassing. When I looked I couldn't see anything that
> looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
> We use other versions of 2.7 on our production platforms. I'll take another
> look, and check the URLs I am using.
>

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, thomas(dot)munro(at)enterprisedb(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 21:20:11
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> The problem is that I downloaded the latest version of the Latin-ASCII
> transliteration file (r34 rather than the r28 specified in the URL). Over 3
> years ago (in r29, of course) they changed the file format (
> https://unicode.org/cldr/trac/ticket/5873) so that
> parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

> I'd be
> happy to either a) support both formats, or b), support just the newest and
> update the URL. Option b) is cleaner, and I can't imagine why anyone would
> want to use an older rule set (then again, struggling with Unicode always
> makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

regards, tom lane


From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-16 02:26:20
Message-ID: CAEepm=3gSmNWkteBxCEL-W+j1dmbcNzDin_iv+f_Om6o+1fAiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> > The problem is that I downloaded the latest version of the Latin-ASCII
> > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > years ago (in r29, of course) they changed the file format (
> > https://unicode.org/cldr/trac/ticket/5873) so that
> > parse_cldr_latin_ascii_transliterator loads an empty rules set.
>
> Ah-hah.
>
> > I'd be
> > happy to either a) support both formats, or b), support just the newest and
> > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > want to use an older rule set (then again, struggling with Unicode always
> > makes my head hurt; I am not an expert on it). Thoughts?
>
> (b) seems sufficient to me, but perhaps someone else has a different
> opinion.
>
> Whichever we do, I think it should be a separate patch from the feature
> addition for combining diacriticals, just to keep the commit history
> clear.

+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

--
Thomas Munro
http://www.enterprisedb.com


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-17 20:22:37
Message-ID: CAAhbUMOX4QLj6c0O3GnjZYtR2dpAowss832Bq1n7oJyByeR7kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sat, 15 Dec 2018 at 21:26, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> +1 for updating to the latest file from time to time. After
> http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
> our special_cases() function will have just the two Cyrillic
> characters, which should almost certainly be handled by adding
> Cyrillic to the ranges we handle via the usual code path, and DEGREE
> CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
> extracted from Unicode.txt (or we could just forget about them), and
> then we could drop special_cases().
>
Well, when I modified the code to handle the new version of the
transliteration file, I discovered that was sufficient to handle the old
version as well. That's not the way things usually go, but I'll take it. ;-)

I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file. I'll
be happy to add these to the CF. Does anyone need to review them and give
me approval before I do so?

Best wishes,
Hugh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-17 20:31:07
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> I've attached two patches, one to update generate_unaccent_rules.py, and
> another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

> I'll
> be happy to add these to the CF. Does anyone need to review them and give
> me approval before I do so?

Nope.

regards, tom lane


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 01:03:13
Message-ID: CAAhbUMOe_VeJBxVqPe4Op2XKrT+xCcCvsUYRH4v6G71NDGG=fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> > I've attached two patches, one to update generate_unaccent_rules.py, and
> > another that updates unaccent.rules from the v34 transliteration file.
>
> I think you forgot the patches?
>

Sigh, yes I did. That's what I get for trying to get this sent out before
heading to an appointment. Patches attached and will add to CF. Let me know
if you see anything amiss.

Hugh

Attachment Content-Type Size
unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patch text/x-patch 338 bytes
generate_unaccent_rules-handle-all-Latin-ASCII-versions.patch text/x-patch 1.3 KB

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: hugh(at)whtc(dot)ca
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 04:05:00
Message-ID: CAEepm=0qb_nx-f8cACS1=1NdmCj-3D9zXFU+RJHsFbZEztcqjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

+ʹ '
+ʺ "
+ʻ '
+ʼ '
+ʽ '
+˂ <
+˃ >
+˄ ^
+ˆ ^
+ˈ '
+ˋ `
+ː :
+˖ +
+˗ -
+˜ ~

I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:

select unaccent('un café crème s''il vous plaît');

It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
x.sql application/octet-stream 58 bytes

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: hugh(at)whtc(dot)ca
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 04:10:25
Message-ID: CAEepm=1vRrNyam3ietQQ6ZdJ5JktkUphCEB0=_mPAKz8mjBB-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:
> +ʹ '
> +ʺ "
> +ʻ '
> +ʼ '
> +ʽ '
> +˂ <
> +˃ >
> +˄ ^
> +ˆ ^
> +ˈ '
> +ˋ `
> +ː :
> +˖ +
> +˗ -
> +˜ ~
>
> I don't think this is quite right. Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all. Here is my attempt to come up with a test case using
> combining characters:
>
> select unaccent('un café crème s''il vous plaît');

Oh, I see now that that was just the v34 ASCII transliteration update,
and perhaps the diacritic stripping will be posted separately.

--
Thomas Munro
http://www.enterprisedb.com


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: hugh(at)whtc(dot)ca, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 04:57:08
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:
> I don't think this is quite right. Those don't seem to be the
> combining codepoints[1], and in any case they are being replaced with
> ASCII characters, whereas I thought we wanted to replace them with
> nothing at all. Here is my attempt to come up with a test case using
> combining characters:
>
> select unaccent('un café crème s''il vous plaît');
>
> It's not stripping the accents. I've attached that in a file for
> reference so you can run it with psql -f x.sql, and you can see that
> it's using combining code points (code points 0301, 0300, 0302 which
> come out as cc81, cc80, cc82 in UTF-8) like so:

Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 05:36:02
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
> the same time? That would be nice to check easily the extent of the
> patches proposed on this thread.

I wonder why unaccent.sql is set up to run its tests in KOI8 client
encoding rather than UTF8. It doesn't seem like it's the business
of this test script to be verifying transcoding from KOI8 to UTF8
(and if it were meant to do that, it's a pretty incomplete test...).
But having it set up like that means that we can't directly add
such tests to unaccent.sql, because there are no combining diacritics
in the KOI8 character set. We have two unattractive options:

* Change client encodings partway through unaccent.sql. I think this
would be disastrous for editability of that file; no common tools
will understand the encoding change.

* Put the new test cases into a separate file with a different client
encoding. This is workable, I suppose, but it seems pretty silly
when the tests are only a few queries apiece.

Another problem I've got with the current setup is that it seems
unlikely that many people's editors default to an assumption of
KOI8 encoding. Mine guesses that these files are UTF8, and so
the test cases look perfectly insane. They do make sense if
I transcode the files to UTF8, but I wonder why we're not shipping
them as UTF8 in the first place.

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding. Then, adding more test cases for this patch
will be easy.

regards, tom lane


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 06:07:35
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
> tl;dr: I think we should convert unaccent.sql and unaccent.out
> to UTF8 encoding. Then, adding more test cases for this patch
> will be easy.

Do you think that we could also remove the non-ASCII characters from the
tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
input, and show the output with bytea. That's harder to read, still we
discussed about not using UTF-8 in the python script to allow folks with
simple terminals to touch the code the last time this was touched
(5e8d670) and the characters used could be documented as comments in the
tests.
--
Michael


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 06:23:57
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:
>> tl;dr: I think we should convert unaccent.sql and unaccent.out
>> to UTF8 encoding. Then, adding more test cases for this patch
>> will be easy.

> Do you think that we could also remove the non-ASCII characters from the
> tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
> input, and show the output with bytea.

I'm not really for that, because it would make the test cases harder
to verify by eyeball. With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:

SELECT unaccent('ёлка');
unaccent
----------
елка
(1 row)

probably represents unaccent doing what it ought to. If everything
is in hex then it's a lot harder.

Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.

> That's harder to read, still we
> discussed about not using UTF-8 in the python script to allow folks with
> simple terminals to touch the code the last time this was touched
> (5e8d670) and the characters used could be documented as comments in the
> tests.

Maybe I'm misremembering, but I thought that discussion was about the
code files. I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.

regards, tom lane


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, hugh(at)whtc(dot)ca, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 06:33:04
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 01:23:57AM -0500, Tom Lane wrote:
> Maybe I'm misremembering, but I thought that discussion was about the
> code files. I am still mistrustful of non-ASCII in our code files.

Yes, that was in generate_unaccent_rules.py:
https://www.postgresql.org/message-id/[email protected]

> But for data and test files, we've been accepting UTF8 ever since the
> text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.

Okay, fine by me.
--
Michael


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 13:01:00
Message-ID: CAAhbUMMzPERSe3KfKKQfR4COJCZSrss1G7KRyUraYJyvrVyOUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, 17 Dec 2018 at 23:05, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> +ʹ '
> +ʺ "
> +ʻ '
> +ʼ '
> +ʽ '
> +˂ <
> +˃ >
> +˄ ^
> +ˆ ^
> +ˈ '
> +ˋ `
> +ː :
> +˖ +
> +˗ -
> +˜ ~
>
These aren't the combining codepoints. They're new substitutions defined in
r34 of the Latin-ASCII transliteration file. I had wondered about those,
too, and did some testing.

I don't think this is quite right.
>

However, you are correct that something isn't write. In testing why I was
getting a different output, I had reverted to the
generate_unaccent_rules.py BEFORE my changes. And then I applied my update
for the transliteration file format to the reverted version. The patch for
generate_unaccent_rules should still be good, but the generated rules file
didn't include the combining diacriticals. In generating that, I want to
double check some of the additions before re-submitting.

On Mon, 17 Dec 2018 at 23:57, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
> the same time? That would be nice to check easily the extent of the
> patches proposed on this thread.

That makes sense. I'm happy to do that. Let me look at that file and see
how extensive the other changes (encoding and removal of special characters
would be).

Hugh


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-20 22:39:36
Message-ID: CAAhbUMNyZ+PhNr_mQ=G161K0-hvbq13Tz2is9M3WK+yX9cQOCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Okay, I've tried to separate everything cleanly. The patches are numbered
in the order in which they should be applied. Each patch contains all the
updates appropriate to that version (i.e., if the change would modify
unaccent.rules, those changes are also in the patch):

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
The approach I have taken is "native" Python 3 compatibility with
adjustments for Python 2. There's a marked block at the beginning of the
file that can be removed whenever Python 2 support is dropped. I haven't
followed the recommended practice of importing the "past" or "future"
modules, as the changes are minimal, and these are just additional
dependencies that need to be installed separately, which didn't seem to
make sense for a utility script. This patch also updates sql/unaccent.sql
to UTF-8 format.

02 - Updates generate_unaccent_rules.py to work with all versions (I tested
r28 and r34) of the Latin-ASCII transliteration file. It also updates
unaccent.rules to have the output of the r34 transliteration file. This
patch should work without the 01 patch.

03 - Updates generate_unaccent_rules.py to remove combining diacritical
marks. It also updates unaccent.rules with the revised output, and adds
tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not
applied. It should without the 02 patch.

When you look at unaccent.rules generated by the 03 version, there may
appear to be blank lines. I've checked and they're not blank. They are
characters which are only visible with other characters in front of them,
at least in my editor.

I'll go update the CommitFest now. I hope I've covered everything; please
let me know if there's anything I've missed.

Best wishes,
Hugh

Attachment Content-Type Size
01-generate-unaccent-rules-python2-and-3-01.patch text/x-patch 4.2 KB
02-generate_unaccent_rules-handle-all-Latin-ASCII-versions-01.patch text/x-patch 1.7 KB
03-generate_unaccent_rules-remove-combining-diacritical-accents-01.patch text/x-patch 3.9 KB

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: thomas(dot)munro(at)enterprisedb(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-27 06:49:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Dec 20, 2018 at 05:39:36PM -0500, Hugh Ranalli wrote:
> I'll go update the CommitFest now. I hope I've covered everything; please
> let me know if there's anything I've missed.

-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+# (Ideally you should use the latest release).

I have begun playing with this patch set. And for the note this URL
is incorrect. Here is a more correct one:
https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml

And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags
--
Michael


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: thomas(dot)munro(at)enterprisedb(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-27 16:21:52
Message-ID: CAAhbUMM=QVvjG7C_skSGyHHF6G_amkNRP-m+OYuK+4o=roFT3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, 27 Dec 2018 at 01:50, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> -# [2]
> http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
> +# [2]
> http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
> +# (Ideally you should use the latest release).
>
> I have begun playing with this patch set. And for the note this URL
> is incorrect. Here is a more correct one:
>
> https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml
>
> And for the information it is possible to get the latest released
> versions by browsing the code (see the tags release-*):
> https://unicode.org/cldr/trac/browser/tags

Thank you. As I've said, I only pretend to be someone who knows something
about Unicode. ;-) I'll update once we've determined there is no further
feedback, so I'm not releasing too many changes, if that's okay.

Hugh


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>, thomas(dot)munro(at)enterprisedb(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-02 17:41:40
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 20/12/2018 23:39, Hugh Ranalli wrote:
> 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.

My opinion is that we should just convert the whole thing to Python 3
and be done. This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

The only other Python script I can find in the source is
src/test/locale/sort-test.py, which we should similarly convert.

> This patch also updates sql/unaccent.sql to UTF-8 format.

I have committed that in the meantime.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: thomas(dot)munro(at)enterprisedb(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-02 17:58:03
Message-ID: CAAhbUMOG_S4DLQSDw3Jbqf8vipuvCAfxvH7T8+W2H0jFA-2K-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <
peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:

> On 20/12/2018 23:39, Hugh Ranalli wrote:
> > 01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
>
> My opinion is that we should just convert the whole thing to Python 3
> and be done. This script is only run rarely, on a developer's machine,
> so it's not unreasonable to expect Python 3 to be available.
>

Well, this is definitely an edge case, but I am actually running the
patched script from a complex application installer running a
custom-compiled version of Python 2.7. The installer runs under the same
Python instance as the application. I certainly could invoke Python 3 to
run this script, it's just a little more work, so I'm happy to go with the
team's decision. Just let me know.

Hugh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-02 19:32:32
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <
> peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>> My opinion is that we should just convert the whole thing to Python 3
>> and be done. This script is only run rarely, on a developer's machine,
>> so it's not unreasonable to expect Python 3 to be available.

> Well, this is definitely an edge case, but I am actually running the
> patched script from a complex application installer running a
> custom-compiled version of Python 2.7. The installer runs under the same
> Python instance as the application. I certainly could invoke Python 3 to
> run this script, it's just a little more work, so I'm happy to go with the
> team's decision. Just let me know.

Seeing that supporting python 2 only adds a dozen lines of code,
I vote for retaining it for now. It'd be appropriate to drop that when
python 3 is the overwhelmingly more-installed version, but AFAICT that
isn't the case yet.

regards, tom lane


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-03 01:15:22
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Jan 02, 2019 at 02:32:32PM -0500, Tom Lane wrote:
> Seeing that supporting python 2 only adds a dozen lines of code,
> I vote for retaining it for now. It'd be appropriate to drop that when
> python 3 is the overwhelmingly more-installed version, but AFAICT that
> isn't the case yet.

As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps. Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).
--
Michael


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-03 16:19:43
Message-ID: CAAhbUMPg7szHquY_3czBgNkSs1sCoGEmgwZ8OSQ7hSDDBUXRaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, 2 Jan 2019 at 20:15, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> As a side note, if I recall correctly Python 2.7 will be EOL'd in
> 2020 by community, though I suspect that a couple of vendors will
> still maintain compatibility for a couple of years in what they ship.
> CentOS and RHEL enter in this category perhaps. Like Peter, I would
> vote for just maintaining support for Python 3 in this script, as any
> modern development machines have it anyway, and not a lot of commits
> involve it (I am counting 4 since 2015).
>

I realise this is an incredibly minor component of the PostgreSQL
infrastructure, but as I don't want to hold up reviewers, may I ask:

- It seems we have two votes for Python 3 only, and one for Python 2/3.
I lean toward Python 2/3 myself because: a) many distributions still ship
with Python 2 as the default and b) it's a single code block that can
easily be removed. If the decision is for Python 3, I'd like at least to
add a check that catches this and prints a message, rather than leaving
someone with a cryptic runtime error that makes them think the script is
broken;
- Michael Paquier, do you have any other comments? If not, I'll adjust
the documentation to use the URLs you have indicated. If you are
downloading via curl or wget, the URL I used is the proper one. It gives
you the XML file, whereas the other saves the HTML interface, leading to
errors if you try to run it. I'll also add this to the documentation.

Once I have clarification on these, I'll update the patches.

Thanks,
Hugh


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-03 18:19:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 2019-Jan-03, Hugh Ranalli wrote:

> I realise this is an incredibly minor component of the PostgreSQL
> infrastructure, but as I don't want to hold up reviewers, may I ask:
>
> - It seems we have two votes for Python 3 only, and one for Python 2/3.
> I lean toward Python 2/3 myself because: a) many distributions still ship
> with Python 2 as the default and b) it's a single code block that can
> easily be removed. If the decision is for Python 3, I'd like at least to
> add a check that catches this and prints a message, rather than leaving
> someone with a cryptic runtime error that makes them think the script is
> broken;

I kinda agree with Peter that this is a fringe, rarely run program where
the python3 requirement is unlikely to be onerous, but since the 2/3
compatibility is so little code, I would opt for keeping it for the time
being. We can remove it in a couple of years.

> - Michael Paquier, do you have any other comments? If not, I'll adjust
> the documentation to use the URLs you have indicated. If you are
> downloading via curl or wget, the URL I used is the proper one. It gives
> you the XML file, whereas the other saves the HTML interface, leading to
> errors if you try to run it. I'll also add this to the documentation.

I think the point is that if the committee updates with a further
version of the file, how do you find the new version? We need a URL
that's one step removed from the final file, so that we can see if we
need to update it. Maybe we can provide both URLs for convenience.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-03 18:22:24
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> I think the point is that if the committee updates with a further
> version of the file, how do you find the new version? We need a URL
> that's one step removed from the final file, so that we can see if we
> need to update it. Maybe we can provide both URLs for convenience.

+1. Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.

regards, tom lane


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-03 21:48:33
Message-ID: CAAhbUMMrNL9uXXpwS+BPTGjKQv5Sxvr+OY2C73Ag7oFyjSOiEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> > I think the point is that if the committee updates with a further
> > version of the file, how do you find the new version? We need a URL
> > that's one step removed from the final file, so that we can see if we
> > need to update it. Maybe we can provide both URLs for convenience.
>
> +1. Could be phrased along the lines of "documents are at URL1,
> currently synced with URL2" so that it's clear that URL2 should
> be updated when we re-sync with a newer release.
>

Yes, this is what I was thinking. I was integrating this into my installer,
used the "new" URL provided to download the file, and spent several minutes
wondering why the script was failing (and what I had broken in it), before
realising what had happened.


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-04 10:32:31
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 03/01/2019 19:19, Alvaro Herrera wrote:
> I kinda agree with Peter that this is a fringe, rarely run program where
> the python3 requirement is unlikely to be onerous, but since the 2/3
> compatibility is so little code, I would opt for keeping it for the time
> being. We can remove it in a couple of years.

OK, committed with the compat layer. I also fixed up sort-test.py for
Python 3, so now everything in the source should support Python 3.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-04 13:00:48
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Jan 03, 2019 at 04:48:33PM -0500, Hugh Ranalli wrote:
> On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
>>> I think the point is that if the committee updates with a further
>>> version of the file, how do you find the new version? We need a URL
>>> that's one step removed from the final file, so that we can see if we
>>> need to update it. Maybe we can provide both URLs for convenience.
>>
>> +1. Could be phrased along the lines of "documents are at URL1,
>> currently synced with URL2" so that it's clear that URL2 should
>> be updated when we re-sync with a newer release.
>>
>
> Yes, this is what I was thinking. I was integrating this into my installer,
> used the "new" URL provided to download the file, and spent several minutes
> wondering why the script was failing (and what I had broken in it), before
> realising what had happened.

I think that we could just use the URLs I am mentioning here:
https://www.postgresql.org/message-id/[email protected]

I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.
--
Michael


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-04 16:29:42
Message-ID: CAAhbUMOj8r4=V0nQpa21SBh7tHTZ37QxXKx5JSpYqdsA-V6mSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, 4 Jan 2019 at 08:00, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> I haven't been able to finish what I wanted for the proposed patch set
> yet, but what I was thinking about is to include:
> 1) The root URL where all the release folders are present
> 2) The full URL of the current Latin-ASCII.xml being used for the
> generation, not as a URL pointing to the latest version, but as a URL
> pointing to an exact version in time (I doubt that a released version
> never changes in this tree, but who knows..).
> 3) The version used to generate the rules.
>

Hi Michael,
I think we're on the same page. I'll wait for you to finish your review and
provide any further comments before I make any changes.

Thanks,
Hugh


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-09 03:52:53
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi Hugh,

On Fri, Jan 04, 2019 at 11:29:42AM -0500, Hugh Ranalli wrote:
> I think we're on the same page. I'll wait for you to finish your review and
> provide any further comments before I make any changes.

I have been doing a bit more than a review by studying by myself the
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself. On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check. I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules. This does not close this thread
but we can support the new format this way. I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.

This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction. What do
you think?
--
Michael

Attachment Content-Type Size
unaccent-format-update.patch text/x-diff 3.8 KB

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-10 02:52:05
Message-ID: CAAhbUMNZ0ooK6SzLNdkxzdBsQHOJf_rg_EjwoNL8QHTwQuriRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, 8 Jan 2019 at 22:53, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> I have been doing a bit more than a review by studying by myself the
> new format and the old format, and the way we could do things in the
> XML parsing part, and hacked the code by myself. On top of the
> incorrect URL for Latin-ASCII.xml, I have noticed as well that there
> should be only one block transforms/transform/tRule in the source, so
> I think that we should add an assertion on that as a sanity check. I
> have also changed the code to use splitlines(), which is more portable
> across platforms, and added an extra regression test for the new
> characters added to unaccent.rules. This does not close this thread
> but we can support the new format this way. I have also documented
> the way to browse the full set of releases for Latin-ASCII.xml, and
> precisely which version has been used for this patch.
>
> This does not close yet the part for diacritical characters, but
> supporting the new format is a step into this direction. What do
> you think?
>
HI Michael,
Thank you for putting so much effort into this. I think that looks great.
When I was doing this, I discovered that I could parse both pre- and post-
r29 versions, so I went with that, but I agree that there's probably no
good reason to do so.

And thank you for the information on splitlines; that's a method I've
overlooked. .split('\n') should be identical, if python is, as usual,
compiled with universal newlines support, but it's nice to have a method
guaranteed to work in all instances.

Best wishes,
Hugh


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-10 06:09:45
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Jan 09, 2019 at 09:52:05PM -0500, Hugh Ranalli wrote:
> Thank you for putting so much effort into this. I think that looks great.
> When I was doing this, I discovered that I could parse both pre- and post-
> r29 versions, so I went with that, but I agree that there's probably no
> good reason to do so.

OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters. Patch 3 has a conflict for
the regression tests, so a rebase would be needed. That's not a big
deal though to resolve the conflict. I am also a bit confused by the
newly-generated unaccent.rules. Why nothing shows up for the second
column (around line 414 for example)? Shouldn't we have mapping
characters?
--
Michael


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-10 14:10:43
Message-ID: CAAhbUMMsJ-p4QNk_LEOG3L61Z92o=McXXfy8qA3b7=xXGKuvtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, 10 Jan 2019 at 01:09, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> OK, committed then. I have yet to study yet the other part of the
> proposal regarding diatritical characters. Patch 3 has a conflict for
> the regression tests, so a rebase would be needed. That's not a big
> deal though to resolve the conflict. I am also a bit confused by the
> newly-generated unaccent.rules. Why nothing shows up for the second
> column (around line 414 for example)? Shouldn't we have mapping
> characters?
>

That concerned me, as well. I have confirmed the lines are not empty. If
you open the file in a text editor (I'm using KDE's Kate), and insert a
standard character at the beginning of one of those lines, the diacritic
then appears, combined with the character you just entered. The only
program I've found that wants to display them on their own is vi (and I
only just thought of trying that).

From what I can tell, this is likely a font issue:

- http://unicode.org/faq/char_combmark.html#12b
-
https://superuser.com/questions/852901/why-are-some-combining-diacritics-shifted-to-the-right-in-some-programs

Hugh


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-28 06:45:45
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thomas,

On Thu, Jan 10, 2019 at 03:09:45PM +0900, Michael Paquier wrote:
> OK, committed then. I have yet to study yet the other part of the
> proposal regarding diatritical characters. Patch 3 has a conflict for
> the regression tests, so a rebase would be needed. That's not a big
> deal though to resolve the conflict. I am also a bit confused by the
> newly-generated unaccent.rules. Why nothing shows up for the second
> column (around line 414 for example)? Shouldn't we have mapping
> characters?

You are registered as a reviewer and committer of the last patch of
this thread:
https://commitfest.postgresql.org/21/1924/

Are you planning to look at it or should I jump in? I have not looked
at the patch status in depth yet.
--
Michael


From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-28 07:26:29
Message-ID: CAEepm=0ZaLLK8BxeW76cbg6zDQJixi11SMejifhx9jF+tU09dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Jan 28, 2019 at 7:45 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> You are registered as a reviewer and committer of the last patch of
> this thread:
> https://commitfest.postgresql.org/21/1924/
>
> Are you planning to look at it or should I jump in? I have not looked
> at the patch status in depth yet.

Thanks for the reminder. I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch? The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll double check with that test case I came up with
earlier, and then I'll be happy to commit it.

--
Thomas Munro
http://www.enterprisedb.com


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-28 20:26:12
Message-ID: CAAhbUMN5NiPSNcs+_Z+bGfwa7aX+__1WeFOmZiDcCWF9xOqRiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, 28 Jan 2019 at 02:27, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> Thanks for the reminder. I looked at this a couple of weeks ago when
> you ping me off-list, but I see we're still waiting for a rebase.
> Hugh, can you please post a new patch? The approach looks right to me
> (simply replace the composing diacritics with nothing), so if you post
> a new version I'll double check with that test case I came up with
> earlier, and then I'll be happy to commit it.
>

Hi Thomas,
My apologies; I hadn't realised I was supposed to do this. A rebased
version of patch 03 is attached. Let me know if you have any questions or
need any changes.

Best wishes,
Hugh

Attachment Content-Type Size
03-generate_unaccent_rules-remove-combining-diacritical-accents-02.patch text/x-patch 3.9 KB

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-01 03:44:05
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:
> My apologies; I hadn't realised I was supposed to do this. A rebased
> version of patch 03 is attached. Let me know if you have any questions or
> need any changes.

Moved to next CF.
--
Michael


From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-01 14:25:28
Message-ID: CAEepm=2qfx8Q9PEq3H5KJ74a6RDgpsaJf0kK+PyGEBneC0Sr1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 1, 2019 at 2:44 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:
> > My apologies; I hadn't realised I was supposed to do this. A rebased
> > version of patch 03 is attached. Let me know if you have any questions or
> > need any changes.
>
> Moved to next CF.

I checked that the script generates identical output on my machine.
Committed. Thanks!

--
Thomas Munro
http://www.enterprisedb.com


From: raam narayana <raam(dot)soft(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-10 20:06:25
Message-ID: 154982918542.11785.1374991294537224097.pgcf@coridan.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review


From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: raam narayana <raam(dot)soft(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Hugh Ranalli <hugh(at)whtc(dot)ca>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-10 20:44:01
Message-ID: CAEepm=3GtcMM3+_DEAmM5X=xtDwVo7C9mPTY04vkLCmQoT6zCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Feb 11, 2019 at 7:07 AM raam narayana <raam(dot)soft(at)gmail(dot)com> wrote:
> After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following
>
> Downloaded the following files
>
> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>
> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>
> Executed the below python script
>
> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules
>
> I am using python 3.7.1 and running on Windows 10 Platform
>
> The new status of this patch is: Needs review

Hi Raam,

How does it differ? Can you please share the output you get? I used
Python 2.7 on a Mac, exactly those input files, and my output matched
Hugh's.

--
Thomas Munro
http://www.enterprisedb.com


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: raam narayana <raam(dot)soft(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-11 19:20:42
Message-ID: CAAhbUMODj1cCHjCpZ-=kxJxnVWyTsqu6ZnWe8+gCsb5SGnv=zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam(dot)soft(at)gmail(dot)com> wrote:

> Hi,
>
> After the latest commit in master branch, I was trying to test the python
> script. Ironically I still see that the output from the script is
> completely different from the unaccent.rules file content. Am I missing
> anything.My testing includes the following
>
> Downloaded the following files
>
> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>
>
> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>
> Executed the below python script
>
> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
> --latin-ascii-file Latin-ASCII.xml > unaccent.rules
>
> I am using python 3.7.1 and running on Windows 10 Platform
>
> The new status of this patch is: Needs review
>

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the
data files given above :
- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
- Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program
under Python 2.7. So yes, more information would help. Unfortunately I
don't have a Windows Python environment readily available, but could set
one up if I had to.

Thanks,
Hugh


From: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-11 20:57:31
Message-ID: CAKm4Xs7CBuCW_XQtrVX6ThwSMiL0WK7Cj3nZx2Jymb9eJ=YdMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in
python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set
to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the
patch for the same.Let me know if it makes sense

Regards,
Ram.

On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:

>
> On Sun, 10 Feb 2019 at 15:07, raam narayana <raam(dot)soft(at)gmail(dot)com> wrote:
>
>> Hi,
>>
>> After the latest commit in master branch, I was trying to test the python
>> script. Ironically I still see that the output from the script is
>> completely different from the unaccent.rules file content. Am I missing
>> anything.My testing includes the following
>>
>> Downloaded the following files
>>
>> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
>>
>>
>> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
>>
>> Executed the below python script
>>
>> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
>> --latin-ascii-file Latin-ASCII.xml > unaccent.rules
>>
>> I am using python 3.7.1 and running on Windows 10 Platform
>>
>> The new status of this patch is: Needs review
>>
>
> Hi Raam,
> I just ran generate_unaccent_rules.py under two environments, using the
> data files given above :
> - Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
> - Python 3.6.7 on Ubuntu 18.04
>
> In both cases, the output was identical to that generated by the program
> under Python 2.7. So yes, more information would help. Unfortunately I
> don't have a Windows Python environment readily available, but could set
> one up if I had to.
>
> Thanks,
> Hugh
>

--
Cheers
Ram 4.0

Attachment Content-Type Size
generate_unaccent_rules-remove-combining-diacritical-accents-03.patch application/octet-stream 544 bytes

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-12 04:18:19
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is set
> to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael


From: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-12 13:54:20
Message-ID: CAKm4Xs4zKcNYW=-E9C8h_o74xhOrw4miZRK0krya1puEqKAECA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi Michael,
The issue was that the python script was working in python 2 but not in
python 3 in Windows. This is because the python script writes the final
output to stdout and stdout encoding is set to utf-8 only for python 2 but
not python 3.If no encoding is set for stdout it takes the encoding from
the Operating system.Default encoding in linux and windows might be
different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> > I tested the script in python 2.7 and it works perfect. The problem is in
> > python 3.7(and may be only in windows as you were not getting the issue)
> > and I was getting the following error
> >
> > UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> > position 0: character maps to <undefined>
> >
> > I went through the python script and found that the stdout encoding is
> set
> > to utf-8 only if python version is <=2.
> >
> > I have made the same change for python version 3 as well. Please find the
> > patch for the same.Let me know if it makes sense
>
> Isn't that because Windows encoding becomes cp1252, utf16 or such?
> FWIW, on Debian SID with Python 3.7, I get the correct output, and no
> diffs on HEAD. Perhaps it would make sense to use open() on the
> different files with encoding='utf-8' to avoid any kind of problems?
> --
> Michael
>

--
Cheers
Ram 4.0


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-12 16:21:35
Message-ID: CAAhbUMN-XjGQthPg5jU0Gf-+09EHVTw4m5wEUaw17QhheAEPsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, 12 Feb 2019 at 08:54, Ramanarayana <raam(dot)soft(at)gmail(dot)com> wrote:

> Hi Michael,
> The issue was that the python script was working in python 2 but not in
> python 3 in Windows. This is because the python script writes the final
> output to stdout and stdout encoding is set to utf-8 only for python 2 but
> not python 3.If no encoding is set for stdout it takes the encoding from
> the Operating system.Default encoding in linux and windows might be
> different.Hence this issue.
> Regards,
> Ram.
>
> On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
>> On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
>> > I tested the script in python 2.7 and it works perfect. The problem is
>> in
>> > python 3.7(and may be only in windows as you were not getting the issue)
>> > and I was getting the following error
>> >
>> > UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
>> > position 0: character maps to <undefined>
>> >
>> > I went through the python script and found that the stdout encoding is
>> set
>> > to utf-8 only if python version is <=2.
>> >
>> > I have made the same change for python version 3 as well. Please find
>> the
>> > patch for the same.Let me know if it makes sense
>>
>> Isn't that because Windows encoding becomes cp1252, utf16 or such?
>> FWIW, on Debian SID with Python 3.7, I get the correct output, and no
>> diffs on HEAD. Perhaps it would make sense to use open() on the
>> different files with encoding='utf-8' to avoid any kind of problems?
>> --
>> Michael
>
>
I can't look at this today, but will fire up Windows and Python tomorrow,
look at Ram's patch, and see what is going on. I'll also look at how we
open the input files, to see if we should supply an encoding. It makes
sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh


From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-17 00:51:08
Message-ID: CAAhbUMOieimkZrCjpw2vQJ-k3p_jzzNsimdi0aq7dwTvKy2isA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam(dot)soft(at)gmail(dot)com> wrote:

> Hi Hugh,
>
> I tested the script in python 2.7 and it works perfect. The problem is in
> python 3.7(and may be only in windows as you were not getting the issue)
> and I was getting the following error
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> position 0: character maps to <undefined>
>
> I went through the python script and found that the stdout encoding is
> set to utf-8 only if python version is <=2.
>
> I have made the same change for python version 3 as well. Please find the
> patch for the same.Let me know if it makes sense
>
> Regards,
> Ram
>

Hi Ram,
I took a look at this, and unfortunately the proposed fix breaks Python 2
(sys.stdout.encoding isn't a writable attribute in Python 2) :-(. I've
attached a patch which is compatible with both versions, and have confirmed
that the output is identical across Python 2 and 3 and across both Windows
and Linux. The output on Windows and Linux is identical, once the
difference in line endings is accounted for.

I've also opened the Unicode data file in UTF-8 and added a "with" block
which ensures we close the file when we are done with it. The change makes
the Python2 compatibility a little more complex (2 blocks to remove), but
it's the cleanest I could achieve.

The attached patch goes on top of patch 02 (not on top of the broken,
committed 03). I'm hoping that's not a problem. If it is, let me know and
I'll factor out the changes.

Please let me know if you have any questions.

Best wishes,
Hugh

Attachment Content-Type Size
generate_unaccent_rules-remove-combining-diacritical-accents-04.patch text/x-patch 6.8 KB

From: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
To: Hugh Ranalli <hugh(at)whtc(dot)ca>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-17 07:15:39
Message-ID: CAKm4Xs6JrgX-fb-WpNMMJSz1kAp3HP8N7wnUKbU3ZHW7eHoG4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi Hugh,

The patch I submitted was tested both in python 2 and 3 and it worked for
me.The single line of code
added in the patch runs only in python 3. I dont think it can break
python2. Would like to see the error you got in python 2 Good to know the
reported issue is a valid one in windows.I tested your patch as well and
it is also working fine.
--
Cheers
Ram 4.0


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Ramanarayana <raam(dot)soft(at)gmail(dot)com>
Cc: Hugh Ranalli <hugh(at)whtc(dot)ca>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-02-18 03:36:48
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Feb 17, 2019 at 12:45:39PM +0530, Ramanarayana wrote:
> The patch I submitted was tested both in python 2 and 3 and it worked for
> me.The single line of code
> added in the patch runs only in python 3. I dont think it can break
> python2. Would like to see the error you got in python 2 Good to know the
> reported issue is a valid one in windows.I tested your patch as well and
> it is also working fine.

I can see that the commit fest entry associated to this thread has
been switched back from "committed" to "Needs Review" with Thomas
Munro still associated as committer. The thing is that we have
already committed all the bits discussed here, so I am switching back
the status as "committed", which reflects the state of the thread. If
you have a set of fixes for what has been pushed regarding Windows and
Python 2/3 capabilities, I would suggest to create a new entry with
yourself as the author. Spawning a new thread would be also nice so
as you attract the correct audience, this thread about initially
diacritical character support for unaccent has been used more than
enough now.

Python 2/3 support for this script is easy enough to check on Linux,
and now you are adding Windows in the mix...

Thanks,
--
Michael


From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hugh(at)whtc(dot)ca, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-12-03 09:01:57
Message-ID: CA+hUKG+OG4bkwe6hn0yEBq2eY=HKuy9D_z2UgXeKjbrav7db5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 3, 2019 at 9:57 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> > > The problem is that I downloaded the latest version of the Latin-ASCII
> > > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > > years ago (in r29, of course) they changed the file format (
> > > https://unicode.org/cldr/trac/ticket/5873) so that
> > > parse_cldr_latin_ascii_transliterator loads an empty rules set.
> >
> > Ah-hah.
> >
> > > I'd be
> > > happy to either a) support both formats, or b), support just the newest and
> > > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > > want to use an older rule set (then again, struggling with Unicode always
> > > makes my head hurt; I am not an expert on it). Thoughts?
> >
> > (b) seems sufficient to me, but perhaps someone else has a different
> > opinion.
> >
> > Whichever we do, I think it should be a separate patch from the feature
> > addition for combining diacriticals, just to keep the commit history
> > clear.
>
> +1 for updating to the latest file from time to time. After
> http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
> our special_cases() function will have just the two Cyrillic
> characters, which should almost certainly be handled by adding
> Cyrillic to the ranges we handle via the usual code path, and DEGREE
> CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
> extracted from Unicode.txt (or we could just forget about them), and
> then we could drop special_cases().

Aha, CLDR 36 included that change, so when we update we can drop a special case.