mb_strpos matches illegal character when needle is '?' #9613

youkidearitai · 2022-09-26T14:02:49Z

Description

The following code:

<?php
var_dump(mb_strpos("AあいうえおBBcC", "?", 0, "ISO-2022-JP"));
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "ISO-2022-JP"));
var_dump(mb_strpos("AあいうえおBBcC", "?", 0, "SJIS"));
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "SJIS"));

Resulted in this output:

int(1)
int(5)
int(1)
int(5)

But I expected this output instead:

bool(false)
bool(false)
bool(false)
bool(false)

note: mb_stripos is affect mb_substitute_character, but mb_strpos is not affect it is (3v4l: https://3v4l.org/872nB )

PHP Version

PHP 8.1.x

Operating System

Ubuntu 20.04 on WSL2

The text was updated successfully, but these errors were encountered:

cmb69 · 2022-09-26T15:10:44Z

This does not look like a bug, because the actual character encoding of "AあいうえおBBcC" is UTF-8, but you tell mb_strpos() to treat that as ISO-2022-JP and SJIS, respectively. If you convert to the desired encodings, you get what you've expected: https://3v4l.org/f4oW8.

youkidearitai · 2022-09-26T15:27:49Z

@cmb69 Thanks for checking. Please read below🙇

I'm talking about bad bytes, not different encodings. For example, if you send a control character, why? will match. This can be used for some attacks.

I create illegal byte from bin2hex("1b204422433222ebbb1b2842")

https://3v4l.org/5VFTv

youkidearitai · 2022-09-26T15:35:03Z

https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfl_convert.c#L108
As you can see, the ? is hardcoded in illegal_substchar. I'm guessing that this is the reason why invalid bytes are caught.

cmb69 · 2022-09-26T15:49:56Z

Oh, indeed, there is something wrong as of PHP 8.1.0.

alexdowad · 2022-10-02T13:22:48Z

@youkidearitai Thanks for discovering this interesting issue.

First, let's be clear about what @youkidearitai is kindly suggesting here. He feels that mb_strpos and mb_stripos should not "find" error markers in an erroneous string. However, historically this is what mb_strpos and mb_stripos have done. If there is to be a change in behavior, it would be good to discuss this more.

@youkidearitai further points out a very interesting issue; when mb_substitute_character is "none", then mb_stripos does not "find" error markers, but mb_strpos still does. Here is the reason:

php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c

Lines 586 to 595 in 63a0775

    
           if (haystack->encoding->no_encoding != mbfl_no_encoding_utf8) { 
        
           	mbfl_string_init(&_haystack_u8); 
        
           	haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, &mbfl_encoding_utf8); 
        
           	if (haystack_u8 == NULL) { 
        
           		result = MBFL_ERROR_ENCODING; 
        
           		goto out; 
        
           	} 
        
           } else { 
        
           	haystack_u8 = haystack; 
        
           }

Note that before performing a search, mbfl_strpos converts the haystack and needle to UTF-8. This is because UTF-8 has the extremely useful property that it is not possible for part of a code unit (or a suffix of one code unit concatenated with a prefix of the next one) to "accidentally" match a valid UTF-8 needle.

With, say, UTF-16BE, this would be very possible. If we took U+1234 U+5678 as haystack and U+3456 as needle, a naive byte match would "find" U+3456 by taking the second byte of one code unit and the first byte of the next. With UTF-8, that can't happen. So we can use a simple byte-for-byte match.

Note one critical thing here: mbfl_strpos does not respect the value of mb_substitute_character when doing the conversion to UTF-8. That explains why it "finds" error markers even when mb_substitute_character is "none".

So why is mb_stripos different?

php-src/ext/mbstring/mbstring.c

Lines 4778 to 4792 in 63a0775

    
           MBSTRING_API size_t php_mb_stripos(int mode, const char *old_haystack, size_t old_haystack_len, const char *old_needle, size_t old_needle_len, zend_long offset, const mbfl_encoding *enc) 
        
           { 
        
           	size_t n = (size_t) -1; 
        
           	mbfl_string haystack, needle; 
        
           	mbfl_string_init_set(&haystack, enc); 
        
           	mbfl_string_init_set(&needle, enc); 
        
           	do { 
        
           		/* We're using simple case-folding here, because we'd have to deal with remapping of 
        
           		 * offsets otherwise. */ 
        
           		size_t len = 0; 
        
           		haystack.val = (unsigned char *)mbstring_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, (char *)old_haystack, old_haystack_len, &len, enc); 
        
           		haystack.len = len;

That's why.

It first converts the haystack and needle to "fold case". The case conversion operation does respect the value of mb_substitute_character, and aside from folding case, it gives us a clean UTF-8 string with no error markers in it. So there are no error markers for mbfl_strpos to find.

There is one question remaining which I haven't investigated yet. @youkidearitai discovered that the following code gives different results before and after PHP 8.1.0:

mb_substitute_character("none");
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "SJIS"));

In PHP 8.1, this behaves as expected. However, in PHP <= 8.0, it still "finds" a question mark. To figure out the reason, I will need to build a copy of PHP 8.0 and step through this code in a debugger. It appears that for some reason, in PHP 8.0, a question mark is still inserted by the case conversion operation, even though it should respect the value of mb_substitute_character.

Will post an update if I find the answer.

hormus · 2022-10-02T15:44:21Z

mb_strpos mb_convert_encoding

<?php

$input_string = urldecode('%1B');
$needle_exists = '?';
var_dump(
    urlencode($input_string), // 27 decimal escape valid UTF-8 or ISO-2022-JP
    mb_detect_encoding($input_string, 'ISO-2022-JP'),
    urlencode(mb_convert_encoding($input_string, "UTF-8", "ISO-2022-JP")),
    mb_substitute_character(),
    mb_strpos($input_string, $needle_exists, 0, "ISO-2022-JP")
);

?>

Expected result:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(3) "%1B"
int(63)
bool(false)

PHP >= 8.2rc1:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(3) "%3F"
int(63)
int(0)

PHP < 8.2rc1:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(0) ""
int(63)
bool(false)

strict true mb_detect_encoding 4.3.3 - 4.3.11, 4.4.0 - 4.4.9, 5.0.0 - 5.0.5, 5.1.0 - 5.1.6, 5.2.0 - 5.2.11, 5.3.0 - 5.3.1, 8.1.0 - 8.1.11

string(3) "%1B"
string(11) "ISO-2022-JP"
string(0) ""
int(63)
bool(false)

alexdowad · 2022-10-03T00:34:00Z

@hormus Thank you for joining this discussion. I guess you wanted to show another case where mb_strpos finds question marks in a string where it does not, at first glance, appear there should be any.

In this case, we have an ISO-2022-JP string with only a bare 0x1B byte. 0x1B introduces an escape sequence in ISO-2022-JP and would normally be followed by some special byte sequence like $B or (B. In PHP < 8.2, the implementation of ISO-2022-JP would just quietly drop such truncated escape sequences without giving any indication that something was wrong. That is why you get string(0) "" as the result of mb_convert_encoding.

For PHP >= 8.2... see b2f963f. That's where the difference comes from. The reason for this change was for consistency with mbstring's handling of other legacy text encodings (for which a truncated escape sequence is generally treated as an error).

Your post indicates that you feel the result of converting a bare 0x1B from ISO-2022-JP to UTF-8 should be another bare 0x1B. This seems hard to justify, and would be a significant change from historical behavior, but perhaps there are good reasons which I have not thought of. If so, please open a separate GH issue for that.

Anyways, I think the main issue raised in this thread is about the way mb_strpos and mb_stripos handle invalid strings.

@youkidearitai I have thought of another, rather strange, symptom of the way mbfl_strpos and php_mb_stripos are implemented. Because mbfl_strpos does not respect mb_substitute_character, if you set mb_substitute_character to some other character such as %, that character will be found in an invalid string by mb_stripos but not by mb_strpos. mb_strpos will still find ?, even when the substitution character is set to something completely different.

It's definitely a strange behavior, and as @youkidearitai mentioned, it's possible to imagine scenarios where it might even become a security issue.

One solution would be to say that mb_strpos and mb_stripos should refuse to operate on invalid input strings... just return an error code if the input string is invalid. This would bypass a lot of tricky questions about what mb_str{,i}pos should do on various types of invalid inputs. However, it is a very invasive change and would almost certainly affect production code in a way that would surprise our users.

Another "solution" is to say: the output of mb_str{,i}pos is "undefined" on invalid input strings. Any output is possible. So... don't do that. 😆 This "solution" amounts to just maintaining the current behavior and leaving it to users to deal with it, probably by using mb_check_encoding to verify their strings are valid before passing them through mb_str{,i}pos. One thing we could do which would be an improvement to the present state of affairs is to include a warning about this problem in the PHP manual.

And then the third option is to say that mbfl_strpos should always use the "none" conversion mode when converting its haystack and needle to UTF-8. Drop all invalid byte sequences from haystack and needle, rather than inserting error markers.

The problem with that is that it still leaves the door open for very surprising behavior, and possibly even security bugs. Say an application receives text from the "outside world" and wants to search for some special character sequence. Say the sequence we want to find is ABC=value. A devious user could send AB<junk bytes>C=value instead. If mb_strpos drops invalid byte sequences, then it would "find" ABC=value in that string!

Next thing, the application would probably use mb_substr to try to extract value. I haven't checked what mb_substr does with invalid strings, but it will probably not do what the user wants in this case.

Comments, please?

alexdowad · 2022-10-04T00:49:18Z

Just thought of another solution...

Instead of mbfl_strpos using our default UTF-8 conversion code (the same code used by mb_convert_encoding), it can use a special "convert wchar to UTF-8" function for the haystack, which will not follow any of mbstring's error-handling modes. Instead, this special conversion function will use an illegal byte sequence (one which cannot appear in any valid UTF-8 string) as an error marker. This will ensure that no needle can ever match against the error marker, even when using our fast and naive raw-byte match.

(Equivalently, we could add a new, internal error-handling mode which is not selectable by mb_substitute_character and which deliberately inserts the aforementioned byte sequence into the output.)

In this way, we avoid the problem of unwanted matches for ? or another error marker such as %, and also avoid the AB<junk bytes>C problem mentioned above.

Further, if the needle is an invalid string, we always return "no match".

Further, for mb_stripos, we fold case after converting the haystack to UTF-8, not before.

Is this a great idea, or is this a great idea?

@nikic @cmb69 @kamil-tekiela @Girgias

alexdowad · 2022-10-04T02:54:04Z

Uh... I guess folding case after converting to UTF-8 would require a completely new case-folding function, because the existing one works on wchars.

Hmm. But if we add a new, internal error-handling mode as mentioned above, we could use that for the case-folding operation as well, and it would still work.

Actually, what we are doing right now in mb_stripos is sub-optimal, in that it folds case from source encoding to source encoding, and then converts to UTF-8 as an additional step before doing the raw byte match. Really, it should convert to UTF-8 as part of the case-folding operation. That should give a nice performance boost, as well as cutting down on the amount of scratch memory used.

cmb69 · 2022-10-04T09:58:25Z

Thanks for the thorough analysis, @alexdowad!

Is this a great idea, or is this a great idea?

I think it is! At least for now we should try to have some sensible behavior wrt. invalid strings which closely matches previous behavior. In the long run, though, I'd prefer to be more strict when invalid strings are passed, but that would need some transition period (maybe deprecate invalid string arguments). As such, documenting the behavior as undefined now, makes sense to me.

alexdowad · 2022-10-05T03:56:24Z

So @cmb69 favors documenting the behavior in the current PHP release as 'undefined', and implementing more reasonable behavior in a future release.

I would like to say that rejecting invalid input strings would have an additional downside; it would impose an extra performance burden when working on UTF-8 strings. Currently, if the input strings are UTF-8, we operate on them without performing any checks or conversions, but if we guarantee that invalid strings will be rejected with an error, then we have to run over those UTF-8 strings and check them.

Maybe that overhead is worth it? Especially if it helps people to avoid security bugs which might lead to their sites getting hacked...

cmb69 · 2022-10-05T13:17:35Z

Currently, if the input strings are UTF-8, we operate on them without performing any checks or conversions, but if we guarantee that invalid strings will be rejected with an error, then we have to run over those UTF-8 strings and check them.

Hmm, maybe we could make use of IS_STR_VALID_UTF8 which so far is used only for PCRE?

alexdowad · 2022-10-05T23:07:36Z

Hmm, maybe we could make use of IS_STR_VALID_UTF8 which so far is used only for PCRE?

😮 I never knew about this!

We could use this to make mb_check_encoding, and probably a couple other functions, faster...

alexdowad · 2022-10-05T23:28:52Z

@cmb69's suggestion that mb_strpos should simply reject invalid input strings seems a lot more practical now.

Maybe I should create an RFC?

cmb69 · 2022-10-06T14:22:19Z

😮 I never knew about this!

I guess only few know this so far …

Maybe I should create an RFC?

… so this would be great! We need to ensure that both PCRE's and MBString's idea of valid UTF-8 matches (I'm not quite sure about which Unicode version are supported by these; likely depends on the version).

Not directly related to MBString, but it might make sense to introduce something like is_valid_utf8() for userland.

kocoten1992 · 2022-10-06T14:46:19Z

Not directly related to MBString, but it might make sense to introduce something like is_valid_utf8() for userland.

Not related, but you reminded me of json_validate (https://wiki.php.net/rfc/json_validate), php have a range of is_*, now I'm torn between is_json vs json_validate.

alexdowad · 2022-10-07T01:23:15Z

@cmb69,

These are all the types of errors which PCRE can detect in a UTF-8 string:

php-src/ext/pcre/pcre2lib/pcre2_valid_utf.c

Lines 111 to 131 in 0001ed2

    
           PCRE2_ERROR_UTF8_ERR1   Missing 1 byte at the end of the string 
        
           PCRE2_ERROR_UTF8_ERR2   Missing 2 bytes at the end of the string 
        
           PCRE2_ERROR_UTF8_ERR3   Missing 3 bytes at the end of the string 
        
           PCRE2_ERROR_UTF8_ERR4   Missing 4 bytes at the end of the string 
        
           PCRE2_ERROR_UTF8_ERR5   Missing 5 bytes at the end of the string 
        
           PCRE2_ERROR_UTF8_ERR6   2nd-byte's two top bits are not 0x80 
        
           PCRE2_ERROR_UTF8_ERR7   3rd-byte's two top bits are not 0x80 
        
           PCRE2_ERROR_UTF8_ERR8   4th-byte's two top bits are not 0x80 
        
           PCRE2_ERROR_UTF8_ERR9   5th-byte's two top bits are not 0x80 
        
           PCRE2_ERROR_UTF8_ERR10  6th-byte's two top bits are not 0x80 
        
           PCRE2_ERROR_UTF8_ERR11  5-byte character is not permitted by RFC 3629 
        
           PCRE2_ERROR_UTF8_ERR12  6-byte character is not permitted by RFC 3629 
        
           PCRE2_ERROR_UTF8_ERR13  4-byte character with value > 0x10ffff is not permitted 
        
           PCRE2_ERROR_UTF8_ERR14  3-byte character with value 0xd800-0xdfff is not permitted 
        
           PCRE2_ERROR_UTF8_ERR15  Overlong 2-byte sequence 
        
           PCRE2_ERROR_UTF8_ERR16  Overlong 3-byte sequence 
        
           PCRE2_ERROR_UTF8_ERR17  Overlong 4-byte sequence 
        
           PCRE2_ERROR_UTF8_ERR18  Overlong 5-byte sequence (won't ever occur) 
        
           PCRE2_ERROR_UTF8_ERR19  Overlong 6-byte sequence (won't ever occur) 
        
           PCRE2_ERROR_UTF8_ERR20  Isolated 0x80 byte (not within UTF-8 character) 
        
           PCRE2_ERROR_UTF8_ERR21  Byte with the illegal value 0xfe or 0xff

This looks like the same types of errors which mbstring can also detect... but to be safe, it will be better to fuzz and see if we can find any differences.

By the way, when I spoke of submitting an RFC, I wasn't talking about the issue of using IS_STR_VALID_UTF8 in mbstring (I would consider that an internal implementation detail), but rather the behavioral change of making mb_strpos and mb_stripos raise an error if the input string is invalid.

cmb69 · 2022-10-07T10:27:29Z

Not related, but you reminded me of json_validate (https://wiki.php.net/rfc/json_validate), php have a range of is_*, now I'm torn between is_json vs json_validate.

Naming things is hard :) I think in this case json_validate() is preferable, because the function is part of the JSON extension.

These are all the types of errors which PCRE can detect in a UTF-8 string:

Thank you for checking! That might already be sufficient (I was mostly concerned that either might not support all 17 planes). Additional fuzzing would be nice, though.

By the way, when I spoke of submitting an RFC, I wasn't talking about the issue of using IS_STR_VALID_UTF8 in mbstring (I would consider that an internal implementation detail), but rather the behavioral change of making mb_strpos and mb_stripos raise an error if the input string is invalid.

Right, that is what the RFC should be about. However, since performance matters, (not) being able to cache the information about already checked strings appears to be important though (and I wouldn't like to introduce another notion of "valid UTF-8" for the zvals).

alexdowad · 2022-11-14T12:41:38Z

Maybe I should create an RFC?

… so this would be great! We need to ensure that both PCRE's and MBString's idea of valid UTF-8 matches (I'm not quite sure about which Unicode version are supported by these; likely depends on the version).

I just tried fuzzing php_mb_check_encoding against valid_utf from pcre2lib. I only ran the fuzzer for a couple of minutes, but it tried over 1 million test cases without finding any where the output was different between the two functions.

Do you think more fuzzing should be done to make sure?

cmb69 · 2022-11-14T12:57:57Z

Do you think more fuzzing should be done to make sure?

I think this is good for now. Thanks for checking!

…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.

…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that base. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.

…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.

youkidearitai added Bug Status: Needs Triage labels Sep 26, 2022

youkidearitai changed the title ~~mb_strpos matches illegal character when haystack is '?'~~ mb_strpos matches illegal character when needle is '?' Sep 26, 2022

cmb69 added the Extension: mbstring label Sep 26, 2022

cmb69 closed this as completed Sep 26, 2022

cmb69 added Status: Invalid and removed Status: Needs Triage labels Sep 26, 2022

cmb69 reopened this Sep 26, 2022

cmb69 added Status: Verified and removed Status: Invalid labels Sep 26, 2022

Girgias assigned alexdowad Sep 27, 2022

alexdowad mentioned this issue Nov 14, 2022

Cache UTF-8-validity status of strings in GC flags #9948

Closed

alexdowad closed this as completed in 7f44559 Dec 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mb_strpos matches illegal character when needle is '?' #9613

mb_strpos matches illegal character when needle is '?' #9613

youkidearitai commented Sep 26, 2022

cmb69 commented Sep 26, 2022

youkidearitai commented Sep 26, 2022

youkidearitai commented Sep 26, 2022

cmb69 commented Sep 26, 2022

alexdowad commented Oct 2, 2022

hormus commented Oct 2, 2022 •

edited

Loading

alexdowad commented Oct 3, 2022

alexdowad commented Oct 4, 2022

alexdowad commented Oct 4, 2022

cmb69 commented Oct 4, 2022

alexdowad commented Oct 5, 2022

cmb69 commented Oct 5, 2022

alexdowad commented Oct 5, 2022

alexdowad commented Oct 5, 2022

cmb69 commented Oct 6, 2022

kocoten1992 commented Oct 6, 2022

alexdowad commented Oct 7, 2022

cmb69 commented Oct 7, 2022

alexdowad commented Nov 14, 2022

cmb69 commented Nov 14, 2022

mb_strpos matches illegal character when needle is '?' #9613

mb_strpos matches illegal character when needle is '?' #9613

Comments

youkidearitai commented Sep 26, 2022

Description

PHP Version

Operating System

cmb69 commented Sep 26, 2022

youkidearitai commented Sep 26, 2022

youkidearitai commented Sep 26, 2022

cmb69 commented Sep 26, 2022

alexdowad commented Oct 2, 2022

hormus commented Oct 2, 2022 • edited Loading

alexdowad commented Oct 3, 2022

alexdowad commented Oct 4, 2022

alexdowad commented Oct 4, 2022

cmb69 commented Oct 4, 2022

alexdowad commented Oct 5, 2022

cmb69 commented Oct 5, 2022

alexdowad commented Oct 5, 2022

alexdowad commented Oct 5, 2022

cmb69 commented Oct 6, 2022

kocoten1992 commented Oct 6, 2022

alexdowad commented Oct 7, 2022

cmb69 commented Oct 7, 2022

alexdowad commented Nov 14, 2022

cmb69 commented Nov 14, 2022

hormus commented Oct 2, 2022 •

edited

Loading