[#70843] Re: [ruby-cvs:58952] hsbt:r51801 (trunk): * lib/rubygems: Update to RubyGems HEAD(fe61e4c112). — Eric Wong <normalperson@...>
[email protected] wrote:
3 messages
2015/09/17
[ruby-core:70853] [Ruby trunk - Bug #6258] [Feedback] String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"
From:
duerst@...
Date:
2015-09-18 09:18:23 UTC
List:
ruby-core #70853
Issue #6258 has been updated by Martin D=C3=BCrst.
Status changed from Assigned to Feedback
Some information gathered during today's commiters' meeting:
This is the relevant information from http://www.unicode.org/Public/UCD/lat=
est/ucd/UnicodeData.txt:
1035;MYANMAR VOWEL SIGN E ABOVE;Mn;0;NSM;;;;;N;;;;;
1036;MYANMAR SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;
1037;MYANMAR SIGN DOT BELOW;Mn;7;NSM;;;;;N;;;;;
1038;MYANMAR SIGN VISARGA;Mc;0;L;;;;;N;;;;;
1039;MYANMAR SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
103A;MYANMAR SIGN ASAT;Mn;9;NSM;;;;;N;;;;;
The only difference between U+1036 and U+1037 is the Canonical Combining Cl=
ass (fourth item, 0 vs. 7).
The code chart for Myanmar is at http://www.unicode.org/charts/PDF/U1000.pd=
f.
Relevant information about the script in the Unicode Standard is at http://=
www.unicode.org/versions/Unicode8.0.0/ch12.pdf (pp. 11ff, in paricular the =
table at p. 13).
The idea behind the behavior of String#succ is to use each character as a d=
igit and circle through the characters in the same alphabet. The simplest c=
ase is a..z or A..Z. The implementation works to some extent for many other=
scripts, but is dependent on things such as whether the characters appear =
contiguously in the relevant character encoding,...
It is unclear what characters 'ideally' should be looped though for Myanmar=
. For example, the W3C does not (yet?) have an alphabetic list style for My=
anmar (see http://www.w3.org/TR/predefined-counter-styles/#myanmar-styles);=
the same applies for most related scripts (Indic/South East Asian). There =
are good arguments for looking only through the (base) consonants (U+1000..=
U+1020). Some variations might include independent vowels, and language-spe=
cific variants may include the relevant extension characters.
In the current implementation, the behavior observed seems to be a conseque=
nce of how the String#succ method uses character data provided by Oniguruma=
/Onigumo. As the subject of the bug say, the current behavior is indeed sur=
prising. But the current implementation isn't really of any use for any but=
some very selected scripts, and Myanmar is definitely not among them.
Once we have information from some reliable source what characters are most=
suitable to loop through in Myanmar, we can think about how to fix this pr=
oblem. So I'm going to set this to "feedback".
----------------------------------------
Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN AN=
USVARA), producing "\u1000" instead of "\u1037"
https://bugs.ruby-lang.org/issues/6258#change-54226
* Author: Devin Ben-Hur
* Status: Feedback
* Priority: Normal
* Assignee: Martin D=C3=BCrst
* ruby -v: ruby 1.9.3p125, ruby 1.9.2p180,
* Backport:=20
----------------------------------------
"\u1036".succ.ord.to_s(16) # =3D> "1000"
Discovered when investigating StackOverflow question http://stackoverflow.c=
om/questions/10020230/anomalous-behavior-while-comparing-a-unicode-characte=
r-to-a-unicode-character-range
Range#=3D=3D=3D ultimately invokes String#upto which uses String#succ
("\u1036".."\u1037").to_a.map{|c| c.ord.to_s(16)}
=3D> ["1036"] # expected ["1036","1037"]
Also once #succ! proceeds past U+1036 it continues to produce U+1000 indefi=
nitely
irb(main):115:0> c =3D "\u1036"
=3D> "=E1=80=B5"
irb(main):116:0> c.ord.to_s(16)
=3D> "1035"
irb(main):117:0> c.succ!.ord.to_s(16)
=3D> "1036"
irb(main):118:0> c.succ!.ord.to_s(16)
=3D> "1000"
irb(main):119:0> c.succ!.ord.to_s(16)
=3D> "1000"
But if one starts naturally at U+1000 #succ! increments as expected
irb(main):001:0> c =3D "\u1000"
=3D> "=E1=80=80"
irb(main):002:0> c.ord.to_s(16)
=3D> "1000"
irb(main):003:0> c.succ!.ord.to_s(16)
=3D> "1001"
irb(main):004:0> c.succ!.ord.to_s(16)
=3D> "1002"
--=20
https://bugs.ruby-lang.org/