Bug 69079 - enhancement for mb_substitute_character #1094

masakielastic · 2015-02-19T13:54:38Z

This pull reques adds function for checking the argument of range.

smalyshev · 2015-02-19T17:44:16Z

ext/mbstring/mbstring.c

+		|| no_enc == mbfl_no_encoding_utf8_sb
+	) {
+		if ((cp > 0 && 0xd800 > cp) || (cp > 0xdfff && 0x110000 > cp)) {
+			return true;


This looks like PHP, not C :)

smalyshev · 2015-02-19T17:44:37Z

Please fix the build, now it fails.

yohgaki · 2015-02-22T02:12:41Z

I wish to have mb_chr()/mb_ord() someday. Thank you. Please fix build, then Stas or I merge it after checking code.

yohgaki · 2016-08-10T06:30:33Z

Could you fetch current master branch and merge them?
I changed your other patch to optimize a little. There should be conflict because of this. Thank you.

masakielastic · 2016-08-13T23:41:17Z

@yohgaki sorry for late reply. I spent my summer vacation and thanks for your efforts.

krakjoe · 2017-01-04T06:58:20Z

Merged f9a435a

Thanks

PS. sorry about the long wait.

* pull-request/1094: added php_mb_check_code_point for mb_substitute_character news entry for PR #1094

nikic · 2017-08-03T18:32:44Z

After investigating #1100 (comment) a bit more, I've noticed that this PR has a similar issue. However in this case it's not just a design problem: the codepoint check does not match the way it ends up being used. The substitute_character is always treated as a Unicode codepoint by the mbfl_convert module, but the check introduced here makes the same unicode/non-unicode distinction as in #1100.

As an example, consider this code:

<?php

$enc = 'EUC-JP-2004';
mb_internal_encoding($enc);

mb_substitute_character(0x8fa1ef); // EUC-JP-2004 encoding of U+50AA
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc)));

mb_substitute_character(0x50aa);
var_dump(bin2hex(mb_convert_encoding("\x8d", $enc, $enc)));

Under PHP 7.1 this produces:

Warning: mb_substitute_character(): Unknown character.
string(2) "3f"
string(6) "8fa1ef"

Under PHP 7.2 with this patch it produces:

string(0) ""
Warning: mb_substitute_character(): Unknown character.
string(0) ""

The behavior in PHP 7.1 is correct, because mb_substitute_character() accepts a Unicode codepoint -- passing U+50AA is accepted and properly encodes to 0x8FA1EF in EUC-JP-2004. The behavior in PHP 7.2 is broken, as passing 0x8fa1ef passes the check in mb_substitute_character(), but is later ignored in mbfl_convert because it's not a legal unicode codepoint, while passing 0x50aa doesn't work either, because the check in mb_substitute_character() rejects it.

However even if this is fixed, there is still a more fundamental problem here: The check is performed against the internal_encoding. However the substitute_character must be legal in the target encoding of the encoding conversion (note: mb_convert_encoding uses the internal encoding as the default source encoding, not target encoding). As such, it is not actually possible to check whether the substitute character is going to be legal -- aside from the very general check that was already implemented previously.

nikic · 2017-08-03T19:21:15Z

I've fixed this in a8a9e93. The check is now again a simple range check. This still resolves the original report from bug #69079 -- the only real problem there was that our range checks were restricted to the BMP.

masakielastic added 13 commits February 19, 2015 12:42

added php_mb_check_encoding

cb02838

added mb_ord

c948e44

added utf32 and ucs4 for available encodings

b065956

added check for forbidden encodings

ec4f74b

added utf16 and ucs2 for supported encodings

674b67c

added support for various encodings other than unicode

b9b47c8

added php_mb_check_forbidden_encoding

a0890c7

added mb_chr

c6e94bf

added check by php_mb_check_forbidden_encoding

a8ef8a2

added various encoding support other than unicode

f10a182

use php_mb_convert_encoding instead of php_mb_check_encoding

89e9746

changed the position of calling php_mb_check_forbidden_encoding

f303f59

added php_mb_check_code_point for mb_substitute_character

aa7eceb

smalyshev reviewed Feb 19, 2015
View reviewed changes

smalyshev added the Feature label Feb 19, 2015

masakielastic added 5 commits February 22, 2015 16:13

fix memory leak

99d90f1

fix return type

6be0f8d

delete unnecessary functions

32de1cf

update the functions for checking the names of encodings

45a8a9c

replace emalloc with safe_emalloc

a22d27d

masakielastic added 2 commits August 14, 2016 06:29

delete duplicate functions

f49a5a6

add declaration of functions

c28a6f4

php-pulls merged commit c28a6f4 into php:master Jan 4, 2017

php-pulls pushed a commit that referenced this pull request Jan 4, 2017

Merge branch 'pull-request/1094'

f9a435a

* pull-request/1094: added php_mb_check_code_point for mb_substitute_character news entry for PR #1094

This was referenced Aug 3, 2017

Request #69086 - enhancement for mb_convert_encoding #1098

Closed

Request #66024 - mb_chr() and mb_ord() #1100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 69079 - enhancement for mb_substitute_character #1094

Bug 69079 - enhancement for mb_substitute_character #1094

masakielastic commented Feb 19, 2015

smalyshev Feb 19, 2015

smalyshev commented Feb 19, 2015

yohgaki commented Feb 22, 2015

yohgaki commented Aug 10, 2016 •

edited

Loading

masakielastic commented Aug 13, 2016

krakjoe commented Jan 4, 2017

nikic commented Aug 3, 2017 •

edited

Loading

nikic commented Aug 3, 2017

Bug 69079 - enhancement for mb_substitute_character #1094

Bug 69079 - enhancement for mb_substitute_character #1094

Conversation

masakielastic commented Feb 19, 2015

smalyshev Feb 19, 2015

Choose a reason for hiding this comment

smalyshev commented Feb 19, 2015

yohgaki commented Feb 22, 2015

yohgaki commented Aug 10, 2016 • edited Loading

masakielastic commented Aug 13, 2016

krakjoe commented Jan 4, 2017

nikic commented Aug 3, 2017 • edited Loading

nikic commented Aug 3, 2017

yohgaki commented Aug 10, 2016 •

edited

Loading

nikic commented Aug 3, 2017 •

edited

Loading