-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Bug 69079 - enhancement for mb_substitute_character #1094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|| no_enc == mbfl_no_encoding_utf8_sb | ||
) { | ||
if ((cp > 0 && 0xd800 > cp) || (cp > 0xdfff && 0x110000 > cp)) { | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like PHP, not C :)
Please fix the build, now it fails. |
I wish to have mb_chr()/mb_ord() someday. Thank you. Please fix build, then Stas or I merge it after checking code. |
Could you fetch current master branch and merge them? |
@yohgaki sorry for late reply. I spent my summer vacation and thanks for your efforts. |
Merged f9a435a Thanks PS. sorry about the long wait. |
* pull-request/1094: added php_mb_check_code_point for mb_substitute_character news entry for PR #1094
After investigating #1100 (comment) a bit more, I've noticed that this PR has a similar issue. However in this case it's not just a design problem: the codepoint check does not match the way it ends up being used. The substitute_character is always treated as a Unicode codepoint by the mbfl_convert module, but the check introduced here makes the same unicode/non-unicode distinction as in #1100. As an example, consider this code: <?php
$enc = 'EUC-JP-2004';
mb_internal_encoding($enc);
mb_substitute_character(0x8fa1ef); // EUC-JP-2004 encoding of U+50AA
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc)));
mb_substitute_character(0x50aa);
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc))); Under PHP 7.1 this produces:
Under PHP 7.2 with this patch it produces:
The behavior in PHP 7.1 is correct, because mb_substitute_character() accepts a Unicode codepoint -- passing U+50AA is accepted and properly encodes to 0x8FA1EF in EUC-JP-2004. The behavior in PHP 7.2 is broken, as passing 0x8fa1ef passes the check in mb_substitute_character(), but is later ignored in mbfl_convert because it's not a legal unicode codepoint, while passing 0x50aa doesn't work either, because the check in mb_substitute_character() rejects it. However even if this is fixed, there is still a more fundamental problem here: The check is performed against the internal_encoding. However the substitute_character must be legal in the target encoding of the encoding conversion (note: mb_convert_encoding uses the internal encoding as the default source encoding, not target encoding). As such, it is not actually possible to check whether the substitute character is going to be legal -- aside from the very general check that was already implemented previously. |
I've fixed this in a8a9e93. The check is now again a simple range check. This still resolves the original report from bug #69079 -- the only real problem there was that our range checks were restricted to the BMP. |
This pull reques adds function for checking the argument of range.