Skip to content

Bug 69079 - enhancement for mb_substitute_character #1094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jan 4, 2017
Merged

Bug 69079 - enhancement for mb_substitute_character #1094

merged 20 commits into from
Jan 4, 2017

Conversation

masakielastic
Copy link
Contributor

This pull reques adds function for checking the argument of range.

|| no_enc == mbfl_no_encoding_utf8_sb
) {
if ((cp > 0 && 0xd800 > cp) || (cp > 0xdfff && 0x110000 > cp)) {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like PHP, not C :)

@smalyshev
Copy link
Contributor

Please fix the build, now it fails.

@yohgaki
Copy link
Contributor

yohgaki commented Feb 22, 2015

I wish to have mb_chr()/mb_ord() someday. Thank you. Please fix build, then Stas or I merge it after checking code.

@yohgaki
Copy link
Contributor

yohgaki commented Aug 10, 2016

Could you fetch current master branch and merge them?
I changed your other patch to optimize a little. There should be conflict because of this. Thank you.

@masakielastic
Copy link
Contributor Author

@yohgaki sorry for late reply. I spent my summer vacation and thanks for your efforts.

@php-pulls php-pulls merged commit c28a6f4 into php:master Jan 4, 2017
@krakjoe
Copy link
Member

krakjoe commented Jan 4, 2017

Merged f9a435a

Thanks

PS. sorry about the long wait.

php-pulls pushed a commit that referenced this pull request Jan 4, 2017
* pull-request/1094:
  added php_mb_check_code_point for mb_substitute_character
  news entry for PR #1094
@nikic
Copy link
Member

nikic commented Aug 3, 2017

After investigating #1100 (comment) a bit more, I've noticed that this PR has a similar issue. However in this case it's not just a design problem: the codepoint check does not match the way it ends up being used. The substitute_character is always treated as a Unicode codepoint by the mbfl_convert module, but the check introduced here makes the same unicode/non-unicode distinction as in #1100.

As an example, consider this code:

<?php

$enc = 'EUC-JP-2004';
mb_internal_encoding($enc);

mb_substitute_character(0x8fa1ef); // EUC-JP-2004 encoding of U+50AA
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc)));

mb_substitute_character(0x50aa);
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc)));

Under PHP 7.1 this produces:

Warning: mb_substitute_character(): Unknown character.
string(2) "3f"
string(6) "8fa1ef"

Under PHP 7.2 with this patch it produces:

string(0) ""
Warning: mb_substitute_character(): Unknown character.
string(0) ""

The behavior in PHP 7.1 is correct, because mb_substitute_character() accepts a Unicode codepoint -- passing U+50AA is accepted and properly encodes to 0x8FA1EF in EUC-JP-2004. The behavior in PHP 7.2 is broken, as passing 0x8fa1ef passes the check in mb_substitute_character(), but is later ignored in mbfl_convert because it's not a legal unicode codepoint, while passing 0x50aa doesn't work either, because the check in mb_substitute_character() rejects it.

However even if this is fixed, there is still a more fundamental problem here: The check is performed against the internal_encoding. However the substitute_character must be legal in the target encoding of the encoding conversion (note: mb_convert_encoding uses the internal encoding as the default source encoding, not target encoding). As such, it is not actually possible to check whether the substitute character is going to be legal -- aside from the very general check that was already implemented previously.

@nikic
Copy link
Member

nikic commented Aug 3, 2017

I've fixed this in a8a9e93. The check is now again a simple range check. This still resolves the original report from bug #69079 -- the only real problem there was that our range checks were restricted to the BMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants