-
Notifications
You must be signed in to change notification settings - Fork 7.8k
mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Confirmed: https://3v4l.org/eYEWE. Looks like MBString doesn't cater to trailing Σs. Note that PHP 7.4 is no longer actively maintained, so a fix can target PHP-8.0 at most. |
Well, it seems that none of the conditional mappings are implemented in MBString. As such, this is more like a feature request. However, I'm not sure whether MBString should implement these conditional mappings at all; the language-sensitive mappings would require some way to specify the language, and while |
Ideally mbstring would implement final sigma handling, I didn't do that at the time for implementation complexity reasons. |
I'd like to apologize that I have taken a long time to investigate this issue. Just looking at it now... |
Hmm, so according to the Unicode 'special casing' table, the only languages which have unique casing requirements are Lithuanian, Turkish, and Azeri. For U+03A3, there is no language condition mentioned... the condition for special casing is just I wonder how many Lithuanian, Turkish, and Azeri PHP users there are? PHP is a popular programming language, so there are almost certainly a few. Because the "final sigma" rule is not indicated as being language-dependent in the special casing table, should we make it apply regardless of language or locale? Also, it looks like the "final sigma" rule only impacts the lowercase conversion code. (U+03A3 maps to itself for title and upper case; it's only lowercase where it changes.) We are currently moving towards doing all operations "a buffer at a time" rather than "a codepoint at a time". Working in this way saves an enormous amount of overhead and makes everything several times faster. After we convert the upper/lower casing code to follow this pattern, I think lookahead would not be such a big problem. We just need to be careful about the case where sigma appears as the last codepoint in a buffer. Sounds good? |
If we have a significant number of e.g. Lithuanian users, I wonder if we might introduce a variant of I think auto-detecting the user's locale and changing behavior based on it is fraught with problems. First, the locale on the computer where software is tested may not be the same as the locale where it runs. Second, for web-based software, the locale on the web server may not be the same as the locale of the users. When you change behavior based on the machine's locale, you are introducing an extra piece of implicit global state, with all the problems which implicit global state presents in general. Better to let the programmer be explicit about what they want. If we did have Of course, said Lithuanians could also implement their language's casing rules in user code, if they wanted to. Maybe I trust @nikic and @cmb69 on whether implementing these features is useful or not, and what the interface should be like. I'm happy to do the actual implementation. By the way, @nikic and @cmb69, since you are both more experienced C programmers than I, can I ask whether you agree that including "locales" in POSIX/ANSI C was probably a bad idea? It seems to me that localization should probably have been handled at a higher level of the stack than libc. It would also be appreciated if you guys can review #8257, which will help to keep |
Yes, we certainly do not depend on locales for this purpose. In fact, we have made good progress on eliminating locale-dependent functionality, with the most recent step being https://wiki.php.net/rfc/strtolower-ascii. I think basing this off mbstring.language is viable, though we'd want to make sure that the language can also be passed explicitly to the relevant case-conversion functions, so as to avoid dependence on global state. I don't think that adding per-language functions makes sense. Supporting Turkish case folding rules at least would probably be of some value, as the way dotted i is handled under default case folding is often perceived as undesirable (creates I with an additional combining mark). I believe we already special-case one Turkish legacy encoding in this regard.
Yes, locales are easily one of the contenders for "worst design mistake in C", though of course there is a lot of competition in this space ;) |
OK, cool. Would adding an optional 3rd argument to The languages recognized by mbstring are English, German, Armenian, Japanese, Korean, Russian, Turkish, Ukrainian, and Chinese. Then we have languages called "neutral" and "universal". 🤷♂️ So we would need to add Lithuanian and Azeri there?
OK. More homework for me to look into that. I guess we would keep the same behavior to avoid a BC break, even if mbstring.language is not Turkish?
OK, thanks for confirming my feelings on that issue. Side note: Have you ever seen the implementation of the POSIX/C locale APIs in musl libc? It's not to be missed. |
If we are not sure about the language-sensitive mappings, maybe we should just go for the language-insensitive mappings for now. |
I'm also cool with that. Again, so as to move forward with this one, it will be much appreciated if my currently open PR for mbstring can be reviewed. I think it's in the critical path. |
This affected me also. <?php
/**
* Splits given text to words seperated by space ,
* then checks if there is σ instead of ς in last character of each word.
* Returns corrected text.
*/
function fixTrailingSigmas($textToCorrect){
$replacement = "ς"; // What you want to replace the last character with
$words = explode( ' ', $textToCorrect );
$corrected = $textToCorrect;
$ans = [];
foreach ($words as $w ){
$fixed = $w;
if ( mb_strpos( mb_substr($w, -1) , 'σ') === 0 ){
$fixed = mb_substr($w, 0, -1).$replacement;
}
array_push( $ans , $fixed);
}
$corrected = implode( ' ', $ans);
return $corrected;
} |
For full case mapping, you have:
Even if there isn't any PHP dev in those countries (and I believe there are), international companies like Facebook have users all around the worlds including in those countries.
In other languages you have either a language independent and a language dependent version of the function or you have only one function that is language dependent. In both case you most always have a locale autodetection and an optional argument where you can specify the locale to override the autodetection. Actually in PHP none of the |
The capital Greek letter sigma (Σ) should be lowercased as σ except when it appears at the end of a word; in that case, it should be lowercased as the special form ς. This rule is included in the Unicode data file SpecialCasing.txt. The condition for applying the rule is called "Final_Sigma" and is defined in Unicode technical report 21. The rule is: • For the special casing form to apply, the capital letter sigma must be preceded by 0 or more "case-ignorable" characters, preceded by at least 1 "cased" character. • Further, capital sigma must NOT be followed by 0 or more case-ignorable characters and then at least 1 cased character. "Case-ignorable" characters include certain punctuation marks, like the apostrophe, as well as various accent marks. There are actually close to 500 different case-ignorable characters, including accent marks from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati, Telugu, Tibetan, and many other alphabets. This category also includes zero-width spaces, codepoints which indicate RTL/LTR text direction, certain musical symbols, etc. Since the rule involves scanning over "0 or more" of such case-ignorable characters, it may be necessary to scan arbitrarily far to the left and right of capital sigma to determine whether the special lowercase form should be used or not. However, since we are trying to be both memory-efficient and CPU-efficient, this implementation limits how far to the left we will scan. Generally, we scan up to 63 characters to the left looking for a "cased" character, but not more. When scanning to the right, we go up to the end of the string if necessary, even if it means scanning over thousands of characters. Anyways, it is almost impossible to imagine that natural text will include "words" with more than 63 successive apostrophes (for example) followed by a capital sigma. Closes phpGH-8096.
The capital Greek letter sigma (Σ) should be lowercased as σ except when it appears at the end of a word; in that case, it should be lowercased as the special form ς. This rule is included in the Unicode data file SpecialCasing.txt. The condition for applying the rule is called "Final_Sigma" and is defined in Unicode technical report 21. The rule is: • For the special casing form to apply, the capital letter sigma must be preceded by 0 or more "case-ignorable" characters, preceded by at least 1 "cased" character. • Further, capital sigma must NOT be followed by 0 or more case-ignorable characters and then at least 1 cased character. "Case-ignorable" characters include certain punctuation marks, like the apostrophe, as well as various accent marks. There are actually close to 500 different case-ignorable characters, including accent marks from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati, Telugu, Tibetan, and many other alphabets. This category also includes zero-width spaces, codepoints which indicate RTL/LTR text direction, certain musical symbols, etc. Since the rule involves scanning over "0 or more" of such case-ignorable characters, it may be necessary to scan arbitrarily far to the left and right of capital sigma to determine whether the special lowercase form should be used or not. However, since we are trying to be both memory-efficient and CPU-efficient, this implementation limits how far to the left we will scan. Generally, we scan up to 63 characters to the left looking for a "cased" character, but not more. When scanning to the right, we go up to the end of the string if necessary, even if it means scanning over thousands of characters. Anyways, it is almost impossible to imagine that natural text will include "words" with more than 63 successive apostrophes (for example) followed by a capital sigma. Closes phpGH-8096.
The capital Greek letter sigma (Σ) should be lowercased as σ except when it appears at the end of a word; in that case, it should be lowercased as the special form ς. This rule is included in the Unicode data file SpecialCasing.txt. The condition for applying the rule is called "Final_Sigma" and is defined in Unicode technical report 21. The rule is: • For the special casing form to apply, the capital letter sigma must be preceded by 0 or more "case-ignorable" characters, preceded by at least 1 "cased" character. • Further, capital sigma must NOT be followed by 0 or more case-ignorable characters and then at least 1 cased character. "Case-ignorable" characters include certain punctuation marks, like the apostrophe, as well as various accent marks. There are actually close to 500 different case-ignorable characters, including accent marks from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati, Telugu, Tibetan, and many other alphabets. This category also includes zero-width spaces, codepoints which indicate RTL/LTR text direction, certain musical symbols, etc. Since the rule involves scanning over "0 or more" of such case-ignorable characters, it may be necessary to scan arbitrarily far to the left and right of capital sigma to determine whether the special lowercase form should be used or not. However, since we are trying to be both memory-efficient and CPU-efficient, this implementation limits how far to the left we will scan. Generally, we scan up to 63 characters to the left looking for a "cased" character, but not more. When scanning to the right, we go up to the end of the string if necessary, even if it means scanning over thousands of characters. Anyways, it is almost impossible to imagine that natural text will include "words" with more than 63 successive apostrophes (for example) followed by a capital sigma. Closes phpGH-8096.
Description
The following code:
Resulted in this output:
But I expected this output instead:
PHP Version
7.4.3
Operating System
ubuntu 20.4
The text was updated successfully, but these errors were encountered: