mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

ThanasisMpalatsoukas · 2022-02-14T16:49:11Z

Description

The following code:

<?php
echo mb_convert_case("ΚΑΛΗΣΠΕΡΑ ΣΑΣ", MB_CASE_TITLE, "UTF-8");

Resulted in this output:

Καλησπερα Σασ

But I expected this output instead:

Καλησπερα Σας

PHP Version

7.4.3

Operating System

ubuntu 20.4

The text was updated successfully, but these errors were encountered:

cmb69 · 2022-02-15T08:57:26Z

Confirmed: https://3v4l.org/eYEWE. Looks like MBString doesn't cater to trailing Σs.

Note that PHP 7.4 is no longer actively maintained, so a fix can target PHP-8.0 at most.

cmb69 · 2022-02-17T10:32:32Z

Well, it seems that none of the conditional mappings are implemented in MBString. As such, this is more like a feature request. However, I'm not sure whether MBString should implement these conditional mappings at all; the language-sensitive mappings would require some way to specify the language, and while mb_language() could be used for that, it is currently only used for email messages, and as such could introduce subtle BC breaks. The language-insensitive mappings would require some kind of lookahead, and as such might not fit well to mbfl. @alexdowad, thoughts?

nikic · 2022-02-22T22:01:52Z

Ideally mbstring would implement final sigma handling, I didn't do that at the time for implementation complexity reasons.

alexdowad · 2022-04-09T10:07:59Z

I'd like to apologize that I have taken a long time to investigate this issue. Just looking at it now...

alexdowad · 2022-04-09T10:16:11Z

Hmm, so according to the Unicode 'special casing' table, the only languages which have unique casing requirements are Lithuanian, Turkish, and Azeri. For U+03A3, there is no language condition mentioned... the condition for special casing is just Final_Sigma.

I wonder how many Lithuanian, Turkish, and Azeri PHP users there are? PHP is a popular programming language, so there are almost certainly a few.

Because the "final sigma" rule is not indicated as being language-dependent in the special casing table, should we make it apply regardless of language or locale?

Also, it looks like the "final sigma" rule only impacts the lowercase conversion code. (U+03A3 maps to itself for title and upper case; it's only lowercase where it changes.)

We are currently moving towards doing all operations "a buffer at a time" rather than "a codepoint at a time". Working in this way saves an enormous amount of overhead and makes everything several times faster. After we convert the upper/lower casing code to follow this pattern, I think lookahead would not be such a big problem. We just need to be careful about the case where sigma appears as the last codepoint in a buffer.

Sounds good?

alexdowad · 2022-04-09T10:49:49Z

If we have a significant number of e.g. Lithuanian users, I wonder if we might introduce a variant of mb_strtolower called mb_strtolower_lithuanian or something like that.

I think auto-detecting the user's locale and changing behavior based on it is fraught with problems. First, the locale on the computer where software is tested may not be the same as the locale where it runs. Second, for web-based software, the locale on the web server may not be the same as the locale of the users. When you change behavior based on the machine's locale, you are introducing an extra piece of implicit global state, with all the problems which implicit global state presents in general.

Better to let the programmer be explicit about what they want. If we did have mb_strtolower_lithuanian, I don't expect that most mbstring users would ever take notice of it, but Lithuanians who were writing web apps intended for use by other Lithuanians might.

Of course, said Lithuanians could also implement their language's casing rules in user code, if they wanted to.

Maybe mb_strtolower_lithuanian is a terrible idea. But if we are going to implement those language-specific rules, definitely the choice of rules needs to be made explicit. mb_language is explicit, but I don't know if it is the right choice for this purpose. As @cmb69 said, it is already used for another purpose.

I trust @nikic and @cmb69 on whether implementing these features is useful or not, and what the interface should be like. I'm happy to do the actual implementation.

By the way, @nikic and @cmb69, since you are both more experienced C programmers than I, can I ask whether you agree that including "locales" in POSIX/ANSI C was probably a bad idea? It seems to me that localization should probably have been handled at a higher level of the stack than libc.

It would also be appreciated if you guys can review #8257, which will help to keep mbstring moving forward.

nikic · 2022-04-09T16:41:09Z

Yes, we certainly do not depend on locales for this purpose. In fact, we have made good progress on eliminating locale-dependent functionality, with the most recent step being https://wiki.php.net/rfc/strtolower-ascii.

I think basing this off mbstring.language is viable, though we'd want to make sure that the language can also be passed explicitly to the relevant case-conversion functions, so as to avoid dependence on global state. I don't think that adding per-language functions makes sense.

Supporting Turkish case folding rules at least would probably be of some value, as the way dotted i is handled under default case folding is often perceived as undesirable (creates I with an additional combining mark). I believe we already special-case one Turkish legacy encoding in this regard.

By the way, @nikic and @cmb69, since you are both more experienced C programmers than I, can I ask whether you agree that including "locales" in POSIX/ANSI C was probably a bad idea? It seems to me that localization should probably have been handled at a higher level of the stack than libc.

Yes, locales are easily one of the contenders for "worst design mistake in C", though of course there is a lot of competition in this space ;)

alexdowad · 2022-04-09T18:10:09Z

I think basing this off mbstring.language is viable, though we'd want to make sure that the language can also be passed explicitly to the relevant case-conversion functions, so as to avoid dependence on global state. I don't think that adding per-language functions makes sense.

OK, cool. Would adding an optional 3rd argument to mb_strtolower et al require an RFC or something?

The languages recognized by mbstring are English, German, Armenian, Japanese, Korean, Russian, Turkish, Ukrainian, and Chinese. Then we have languages called "neutral" and "universal". 🤷‍♂️

So we would need to add Lithuanian and Azeri there?

Supporting Turkish case folding rules at least would probably be of some value, as the way dotted i is handled under default case folding is often perceived as undesirable (creates I with an additional combining mark). I believe we already special-case one Turkish legacy encoding in this regard.

OK. More homework for me to look into that. I guess we would keep the same behavior to avoid a BC break, even if mbstring.language is not Turkish?

Yes, locales are easily one of the contenders for "worst design mistake in C", though of course there is a lot of competition in this space ;)

OK, thanks for confirming my feelings on that issue.

Side note: Have you ever seen the implementation of the POSIX/C locale APIs in musl libc? It's not to be missed.

cmb69 · 2022-04-12T10:14:58Z

If we are not sure about the language-sensitive mappings, maybe we should just go for the language-insensitive mappings for now.

alexdowad · 2022-04-12T10:59:47Z

If we are not sure about the language-sensitive mappings, maybe we should just go for the language-insensitive mappings for now.

I'm also cool with that.

Again, so as to move forward with this one, it will be much appreciated if my currently open PR for mbstring can be reviewed. I think it's in the critical path.

Kos-M · 2022-05-30T08:35:42Z

This affected me also.
Quick workarround:

<?php

	/**
	 * Splits given text to words seperated by space ,
	 * then checks if there is σ instead of ς in last character of each word.
	 * Returns corrected text.
	 */
	function fixTrailingSigmas($textToCorrect){
		$replacement = "ς";    // What you want to replace the last character with
		$words  = explode(  ' ', $textToCorrect );
		$corrected = $textToCorrect;
		$ans = [];
		foreach ($words as $w ){
			$fixed = $w;
			if ( mb_strpos( mb_substr($w, -1)  , 'σ') === 0 ){
				$fixed =  mb_substr($w, 0, -1).$replacement;
			} 
			array_push( $ans , $fixed);
		}
		$corrected = implode( ' ', $ans);
		return $corrected;
	}

noraj · 2023-01-06T14:20:20Z

I'm not sure whether MBString should implement these conditional mappings at all; the language-sensitive mappings would require some way to specify the language

For full case mapping, you have:

context-sensitive stuff, that not necessarily depends on the languages, the sigma example is locale / language independent
language-sensitive stuff, that depends on the locale / language but not necessarily on the context

I wonder how many Lithuanian, Turkish, and Azeri PHP users there are? PHP is a popular programming language, so there are almost certainly a few.

Even if there isn't any PHP dev in those countries (and I believe there are), international companies like Facebook have users all around the worlds including in those countries.

If we have a significant number of e.g. Lithuanian users, I wonder if we might introduce a variant of mb_strtolower called mb_strtolower_lithuanian or something like that.

In other languages you have either a language independent and a language dependent version of the function or you have only one function that is language dependent. In both case you most always have a locale autodetection and an optional argument where you can specify the locale to override the autodetection. Actually in PHP none of the mb_ string manipulation functions take into consideration the locale so you would have to call locale_get_default before and after each mb_ call that wouldn't be viable, so most mb_ functions would need to be updated to have a locale optional argument added. I don't think adding a variant per language like mb_strtolower_lithuanian is reasonable.

The capital Greek letter sigma (Σ) should be lowercased as σ except when it appears at the end of a word; in that case, it should be lowercased as the special form ς. This rule is included in the Unicode data file SpecialCasing.txt. The condition for applying the rule is called "Final_Sigma" and is defined in Unicode technical report 21. The rule is: • For the special casing form to apply, the capital letter sigma must be preceded by 0 or more "case-ignorable" characters, preceded by at least 1 "cased" character. • Further, capital sigma must NOT be followed by 0 or more case-ignorable characters and then at least 1 cased character. "Case-ignorable" characters include certain punctuation marks, like the apostrophe, as well as various accent marks. There are actually close to 500 different case-ignorable characters, including accent marks from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati, Telugu, Tibetan, and many other alphabets. This category also includes zero-width spaces, codepoints which indicate RTL/LTR text direction, certain musical symbols, etc. Since the rule involves scanning over "0 or more" of such case-ignorable characters, it may be necessary to scan arbitrarily far to the left and right of capital sigma to determine whether the special lowercase form should be used or not. However, since we are trying to be both memory-efficient and CPU-efficient, this implementation limits how far to the left we will scan. Generally, we scan up to 63 characters to the left looking for a "cased" character, but not more. When scanning to the right, we go up to the end of the string if necessary, even if it means scanning over thousands of characters. Anyways, it is almost impossible to imagine that natural text will include "words" with more than 63 successive apostrophes (for example) followed by a capital sigma. Closes phpGH-8096.

ThanasisMpalatsoukas added Bug Status: Needs Triage labels Feb 14, 2022

cmb69 added Extension: mbstring Status: Verified and removed Status: Needs Triage labels Feb 15, 2022

cmb69 added Feature and removed Bug labels Feb 17, 2022

alexdowad mentioned this issue Jan 9, 2023

Implement Unicode conditional casing rules for Greek letter sigma #10268

Closed

alexdowad closed this as completed in 39b46a5 Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

ThanasisMpalatsoukas commented Feb 14, 2022 •

edited

Loading

cmb69 commented Feb 15, 2022

cmb69 commented Feb 17, 2022

nikic commented Feb 22, 2022

alexdowad commented Apr 9, 2022

alexdowad commented Apr 9, 2022

alexdowad commented Apr 9, 2022

nikic commented Apr 9, 2022 •

edited

Loading

alexdowad commented Apr 9, 2022

cmb69 commented Apr 12, 2022

alexdowad commented Apr 12, 2022

Kos-M commented May 30, 2022 •

edited

Loading

noraj commented Jan 6, 2023

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

Comments

ThanasisMpalatsoukas commented Feb 14, 2022 • edited Loading

Description

PHP Version

Operating System

cmb69 commented Feb 15, 2022

cmb69 commented Feb 17, 2022

nikic commented Feb 22, 2022

alexdowad commented Apr 9, 2022

alexdowad commented Apr 9, 2022

alexdowad commented Apr 9, 2022

nikic commented Apr 9, 2022 • edited Loading

alexdowad commented Apr 9, 2022

cmb69 commented Apr 12, 2022

alexdowad commented Apr 12, 2022

Kos-M commented May 30, 2022 • edited Loading

noraj commented Jan 6, 2023

ThanasisMpalatsoukas commented Feb 14, 2022 •

edited

Loading

nikic commented Apr 9, 2022 •

edited

Loading

Kos-M commented May 30, 2022 •

edited

Loading