Skip to content

mb_strpos matches illegal character when needle is '?' #9613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
youkidearitai opened this issue Sep 26, 2022 · 20 comments
Closed

mb_strpos matches illegal character when needle is '?' #9613

youkidearitai opened this issue Sep 26, 2022 · 20 comments

Comments

@youkidearitai
Copy link
Contributor

Description

The following code:

<?php
var_dump(mb_strpos("AあいうえおBBcC", "?", 0, "ISO-2022-JP"));
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "ISO-2022-JP"));
var_dump(mb_strpos("AあいうえおBBcC", "?", 0, "SJIS"));
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "SJIS"));

Resulted in this output:

int(1)
int(5)
int(1)
int(5)

But I expected this output instead:

bool(false)
bool(false)
bool(false)
bool(false)

note: mb_stripos is affect mb_substitute_character, but mb_strpos is not affect it is (3v4l: https://3v4l.org/872nB )

PHP Version

PHP 8.1.x

Operating System

Ubuntu 20.04 on WSL2

@youkidearitai youkidearitai changed the title mb_strpos matches illegal character when haystack is '?' mb_strpos matches illegal character when needle is '?' Sep 26, 2022
@cmb69
Copy link
Member

cmb69 commented Sep 26, 2022

This does not look like a bug, because the actual character encoding of "AあいうえおBBcC" is UTF-8, but you tell mb_strpos() to treat that as ISO-2022-JP and SJIS, respectively. If you convert to the desired encodings, you get what you've expected: https://3v4l.org/f4oW8.

@youkidearitai
Copy link
Contributor Author

@cmb69 Thanks for checking. Please read below🙇

I'm talking about bad bytes, not different encodings. For example, if you send a control character, why? will match. This can be used for some attacks.

I create illegal byte from bin2hex("1b204422433222ebbb1b2842")

https://3v4l.org/5VFTv

@youkidearitai
Copy link
Contributor Author

https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfl_convert.c#L108
As you can see, the ? is hardcoded in illegal_substchar. I'm guessing that this is the reason why invalid bytes are caught.

@cmb69
Copy link
Member

cmb69 commented Sep 26, 2022

Oh, indeed, there is something wrong as of PHP 8.1.0.

@alexdowad
Copy link
Contributor

@youkidearitai Thanks for discovering this interesting issue.

First, let's be clear about what @youkidearitai is kindly suggesting here. He feels that mb_strpos and mb_stripos should not "find" error markers in an erroneous string. However, historically this is what mb_strpos and mb_stripos have done. If there is to be a change in behavior, it would be good to discuss this more.

@youkidearitai further points out a very interesting issue; when mb_substitute_character is "none", then mb_stripos does not "find" error markers, but mb_strpos still does. Here is the reason:

if (haystack->encoding->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_haystack_u8);
haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, &mbfl_encoding_utf8);
if (haystack_u8 == NULL) {
result = MBFL_ERROR_ENCODING;
goto out;
}
} else {
haystack_u8 = haystack;
}

Note that before performing a search, mbfl_strpos converts the haystack and needle to UTF-8. This is because UTF-8 has the extremely useful property that it is not possible for part of a code unit (or a suffix of one code unit concatenated with a prefix of the next one) to "accidentally" match a valid UTF-8 needle.

With, say, UTF-16BE, this would be very possible. If we took U+1234 U+5678 as haystack and U+3456 as needle, a naive byte match would "find" U+3456 by taking the second byte of one code unit and the first byte of the next. With UTF-8, that can't happen. So we can use a simple byte-for-byte match.

Note one critical thing here: mbfl_strpos does not respect the value of mb_substitute_character when doing the conversion to UTF-8. That explains why it "finds" error markers even when mb_substitute_character is "none".

So why is mb_stripos different?

MBSTRING_API size_t php_mb_stripos(int mode, const char *old_haystack, size_t old_haystack_len, const char *old_needle, size_t old_needle_len, zend_long offset, const mbfl_encoding *enc)
{
size_t n = (size_t) -1;
mbfl_string haystack, needle;
mbfl_string_init_set(&haystack, enc);
mbfl_string_init_set(&needle, enc);
do {
/* We're using simple case-folding here, because we'd have to deal with remapping of
* offsets otherwise. */
size_t len = 0;
haystack.val = (unsigned char *)mbstring_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, (char *)old_haystack, old_haystack_len, &len, enc);
haystack.len = len;

That's why.

It first converts the haystack and needle to "fold case". The case conversion operation does respect the value of mb_substitute_character, and aside from folding case, it gives us a clean UTF-8 string with no error markers in it. So there are no error markers for mbfl_strpos to find.

There is one question remaining which I haven't investigated yet. @youkidearitai discovered that the following code gives different results before and after PHP 8.1.0:

mb_substitute_character("none");
var_dump(mb_stripos("AあいうえおBBcC", "?", 0, "SJIS"));

In PHP 8.1, this behaves as expected. However, in PHP <= 8.0, it still "finds" a question mark. To figure out the reason, I will need to build a copy of PHP 8.0 and step through this code in a debugger. It appears that for some reason, in PHP 8.0, a question mark is still inserted by the case conversion operation, even though it should respect the value of mb_substitute_character.

Will post an update if I find the answer.

@hormus
Copy link

hormus commented Oct 2, 2022

mb_strpos mb_convert_encoding

<?php

$input_string = urldecode('%1B');
$needle_exists = '?';
var_dump(
    urlencode($input_string), // 27 decimal escape valid UTF-8 or ISO-2022-JP
    mb_detect_encoding($input_string, 'ISO-2022-JP'),
    urlencode(mb_convert_encoding($input_string, "UTF-8", "ISO-2022-JP")),
    mb_substitute_character(),
    mb_strpos($input_string, $needle_exists, 0, "ISO-2022-JP")
);

?>

Expected result:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(3) "%1B"
int(63)
bool(false)

PHP >= 8.2rc1:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(3) "%3F"
int(63)
int(0)

PHP < 8.2rc1:

string(3) "%1B"
string(11) "ISO-2022-JP"
string(0) ""
int(63)
bool(false)

strict true mb_detect_encoding 4.3.3 - 4.3.11, 4.4.0 - 4.4.9, 5.0.0 - 5.0.5, 5.1.0 - 5.1.6, 5.2.0 - 5.2.11, 5.3.0 - 5.3.1, 8.1.0 - 8.1.11

string(3) "%1B"
string(11) "ISO-2022-JP"
string(0) ""
int(63)
bool(false)

@alexdowad
Copy link
Contributor

@hormus Thank you for joining this discussion. I guess you wanted to show another case where mb_strpos finds question marks in a string where it does not, at first glance, appear there should be any.

In this case, we have an ISO-2022-JP string with only a bare 0x1B byte. 0x1B introduces an escape sequence in ISO-2022-JP and would normally be followed by some special byte sequence like $B or (B. In PHP < 8.2, the implementation of ISO-2022-JP would just quietly drop such truncated escape sequences without giving any indication that something was wrong. That is why you get string(0) "" as the result of mb_convert_encoding.

For PHP >= 8.2... see b2f963f. That's where the difference comes from. The reason for this change was for consistency with mbstring's handling of other legacy text encodings (for which a truncated escape sequence is generally treated as an error).

Your post indicates that you feel the result of converting a bare 0x1B from ISO-2022-JP to UTF-8 should be another bare 0x1B. This seems hard to justify, and would be a significant change from historical behavior, but perhaps there are good reasons which I have not thought of. If so, please open a separate GH issue for that.

Anyways, I think the main issue raised in this thread is about the way mb_strpos and mb_stripos handle invalid strings.

@youkidearitai I have thought of another, rather strange, symptom of the way mbfl_strpos and php_mb_stripos are implemented. Because mbfl_strpos does not respect mb_substitute_character, if you set mb_substitute_character to some other character such as %, that character will be found in an invalid string by mb_stripos but not by mb_strpos. mb_strpos will still find ?, even when the substitution character is set to something completely different.

It's definitely a strange behavior, and as @youkidearitai mentioned, it's possible to imagine scenarios where it might even become a security issue.

One solution would be to say that mb_strpos and mb_stripos should refuse to operate on invalid input strings... just return an error code if the input string is invalid. This would bypass a lot of tricky questions about what mb_str{,i}pos should do on various types of invalid inputs. However, it is a very invasive change and would almost certainly affect production code in a way that would surprise our users.

Another "solution" is to say: the output of mb_str{,i}pos is "undefined" on invalid input strings. Any output is possible. So... don't do that. 😆 This "solution" amounts to just maintaining the current behavior and leaving it to users to deal with it, probably by using mb_check_encoding to verify their strings are valid before passing them through mb_str{,i}pos. One thing we could do which would be an improvement to the present state of affairs is to include a warning about this problem in the PHP manual.

And then the third option is to say that mbfl_strpos should always use the "none" conversion mode when converting its haystack and needle to UTF-8. Drop all invalid byte sequences from haystack and needle, rather than inserting error markers.

The problem with that is that it still leaves the door open for very surprising behavior, and possibly even security bugs. Say an application receives text from the "outside world" and wants to search for some special character sequence. Say the sequence we want to find is ABC=value. A devious user could send AB<junk bytes>C=value instead. If mb_strpos drops invalid byte sequences, then it would "find" ABC=value in that string!

Next thing, the application would probably use mb_substr to try to extract value. I haven't checked what mb_substr does with invalid strings, but it will probably not do what the user wants in this case.

Comments, please?

@alexdowad
Copy link
Contributor

Just thought of another solution...

Instead of mbfl_strpos using our default UTF-8 conversion code (the same code used by mb_convert_encoding), it can use a special "convert wchar to UTF-8" function for the haystack, which will not follow any of mbstring's error-handling modes. Instead, this special conversion function will use an illegal byte sequence (one which cannot appear in any valid UTF-8 string) as an error marker. This will ensure that no needle can ever match against the error marker, even when using our fast and naive raw-byte match.

(Equivalently, we could add a new, internal error-handling mode which is not selectable by mb_substitute_character and which deliberately inserts the aforementioned byte sequence into the output.)

In this way, we avoid the problem of unwanted matches for ? or another error marker such as %, and also avoid the AB<junk bytes>C problem mentioned above.

Further, if the needle is an invalid string, we always return "no match".

Further, for mb_stripos, we fold case after converting the haystack to UTF-8, not before.

Is this a great idea, or is this a great idea?

@nikic @cmb69 @kamil-tekiela @Girgias

@alexdowad
Copy link
Contributor

Uh... I guess folding case after converting to UTF-8 would require a completely new case-folding function, because the existing one works on wchars.

Hmm. But if we add a new, internal error-handling mode as mentioned above, we could use that for the case-folding operation as well, and it would still work.

Actually, what we are doing right now in mb_stripos is sub-optimal, in that it folds case from source encoding to source encoding, and then converts to UTF-8 as an additional step before doing the raw byte match. Really, it should convert to UTF-8 as part of the case-folding operation. That should give a nice performance boost, as well as cutting down on the amount of scratch memory used.

@cmb69
Copy link
Member

cmb69 commented Oct 4, 2022

Thanks for the thorough analysis, @alexdowad!

Is this a great idea, or is this a great idea?

I think it is! At least for now we should try to have some sensible behavior wrt. invalid strings which closely matches previous behavior. In the long run, though, I'd prefer to be more strict when invalid strings are passed, but that would need some transition period (maybe deprecate invalid string arguments). As such, documenting the behavior as undefined now, makes sense to me.

@alexdowad
Copy link
Contributor

So @cmb69 favors documenting the behavior in the current PHP release as 'undefined', and implementing more reasonable behavior in a future release.

I would like to say that rejecting invalid input strings would have an additional downside; it would impose an extra performance burden when working on UTF-8 strings. Currently, if the input strings are UTF-8, we operate on them without performing any checks or conversions, but if we guarantee that invalid strings will be rejected with an error, then we have to run over those UTF-8 strings and check them.

Maybe that overhead is worth it? Especially if it helps people to avoid security bugs which might lead to their sites getting hacked...

@cmb69
Copy link
Member

cmb69 commented Oct 5, 2022

Currently, if the input strings are UTF-8, we operate on them without performing any checks or conversions, but if we guarantee that invalid strings will be rejected with an error, then we have to run over those UTF-8 strings and check them.

Hmm, maybe we could make use of IS_STR_VALID_UTF8 which so far is used only for PCRE?

@alexdowad
Copy link
Contributor

Hmm, maybe we could make use of IS_STR_VALID_UTF8 which so far is used only for PCRE?

😮 I never knew about this!

We could use this to make mb_check_encoding, and probably a couple other functions, faster...

@alexdowad
Copy link
Contributor

@cmb69's suggestion that mb_strpos should simply reject invalid input strings seems a lot more practical now.

Maybe I should create an RFC?

@cmb69
Copy link
Member

cmb69 commented Oct 6, 2022

😮 I never knew about this!

I guess only few know this so far …

Maybe I should create an RFC?

… so this would be great! We need to ensure that both PCRE's and MBString's idea of valid UTF-8 matches (I'm not quite sure about which Unicode version are supported by these; likely depends on the version).

Not directly related to MBString, but it might make sense to introduce something like is_valid_utf8() for userland.

@kocoten1992
Copy link

Not directly related to MBString, but it might make sense to introduce something like is_valid_utf8() for userland.

Not related, but you reminded me of json_validate (https://wiki.php.net/rfc/json_validate), php have a range of is_*, now I'm torn between is_json vs json_validate.

@alexdowad
Copy link
Contributor

@cmb69,

These are all the types of errors which PCRE can detect in a UTF-8 string:

PCRE2_ERROR_UTF8_ERR1 Missing 1 byte at the end of the string
PCRE2_ERROR_UTF8_ERR2 Missing 2 bytes at the end of the string
PCRE2_ERROR_UTF8_ERR3 Missing 3 bytes at the end of the string
PCRE2_ERROR_UTF8_ERR4 Missing 4 bytes at the end of the string
PCRE2_ERROR_UTF8_ERR5 Missing 5 bytes at the end of the string
PCRE2_ERROR_UTF8_ERR6 2nd-byte's two top bits are not 0x80
PCRE2_ERROR_UTF8_ERR7 3rd-byte's two top bits are not 0x80
PCRE2_ERROR_UTF8_ERR8 4th-byte's two top bits are not 0x80
PCRE2_ERROR_UTF8_ERR9 5th-byte's two top bits are not 0x80
PCRE2_ERROR_UTF8_ERR10 6th-byte's two top bits are not 0x80
PCRE2_ERROR_UTF8_ERR11 5-byte character is not permitted by RFC 3629
PCRE2_ERROR_UTF8_ERR12 6-byte character is not permitted by RFC 3629
PCRE2_ERROR_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted
PCRE2_ERROR_UTF8_ERR14 3-byte character with value 0xd800-0xdfff is not permitted
PCRE2_ERROR_UTF8_ERR15 Overlong 2-byte sequence
PCRE2_ERROR_UTF8_ERR16 Overlong 3-byte sequence
PCRE2_ERROR_UTF8_ERR17 Overlong 4-byte sequence
PCRE2_ERROR_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur)
PCRE2_ERROR_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur)
PCRE2_ERROR_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character)
PCRE2_ERROR_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff

This looks like the same types of errors which mbstring can also detect... but to be safe, it will be better to fuzz and see if we can find any differences.

By the way, when I spoke of submitting an RFC, I wasn't talking about the issue of using IS_STR_VALID_UTF8 in mbstring (I would consider that an internal implementation detail), but rather the behavioral change of making mb_strpos and mb_stripos raise an error if the input string is invalid.

@cmb69
Copy link
Member

cmb69 commented Oct 7, 2022

Not related, but you reminded me of json_validate (https://wiki.php.net/rfc/json_validate), php have a range of is_*, now I'm torn between is_json vs json_validate.

Naming things is hard :) I think in this case json_validate() is preferable, because the function is part of the JSON extension.

These are all the types of errors which PCRE can detect in a UTF-8 string:

Thank you for checking! That might already be sufficient (I was mostly concerned that either might not support all 17 planes). Additional fuzzing would be nice, though.

By the way, when I spoke of submitting an RFC, I wasn't talking about the issue of using IS_STR_VALID_UTF8 in mbstring (I would consider that an internal implementation detail), but rather the behavioral change of making mb_strpos and mb_stripos raise an error if the input string is invalid.

Right, that is what the RFC should be about. However, since performance matters, (not) being able to cache the information about already checked strings appears to be important though (and I wouldn't like to introduce another notion of "valid UTF-8" for the zvals).

@alexdowad
Copy link
Contributor

Maybe I should create an RFC?

… so this would be great! We need to ensure that both PCRE's and MBString's idea of valid UTF-8 matches (I'm not quite sure about which Unicode version are supported by these; likely depends on the version).

I just tried fuzzing php_mb_check_encoding against valid_utf from pcre2lib. I only ran the fuzzer for a couple of minutes, but it tried over 1 million test cases without finding any where the output was different between the two functions.

Do you think more fuzzing should be done to make sure?

@cmb69
Copy link
Member

cmb69 commented Nov 14, 2022

Do you think more fuzzing should be done to make sure?

I think this is good for now. Thanks for checking!

alexdowad added a commit to alexdowad/php-src that referenced this issue Dec 15, 2022
…es of mb_substitute_char

In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.

The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.

Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.

Example:

    <?php
    $questionMark = "\x00?";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";

mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.

Example:

    <?php
    mb_substitute_character(0x25); // '%'
    $percent = "\x00%";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";

It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.

This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.

This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.

Closes phpGH-9613.
alexdowad added a commit to alexdowad/php-src that referenced this issue Dec 15, 2022
…es of mb_substitute_char

In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.

The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.

Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.

Example:

    <?php
    $questionMark = "\x00?";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";

Incidentally, this behavior does not occur if the text encoding is
UTF-8, because no conversion is needed in that base.

mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.

Example:

    <?php
    mb_substitute_character(0x25); // '%'
    $percent = "\x00%";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";

This behavior (of mb_stripos) still occurs even if the text encoding is
UTF-8, because case folding is still needed to make the search
case-insensitive.

It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.

This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.

This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.

Closes phpGH-9613.
alexdowad added a commit to alexdowad/php-src that referenced this issue Dec 15, 2022
…es of mb_substitute_char

In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.

The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.

Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.

Example:

    <?php
    $questionMark = "\x00?";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";

Incidentally, this behavior does not occur if the text encoding is
UTF-8, because no conversion is needed in that case.

mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.

Example:

    <?php
    mb_substitute_character(0x25); // '%'
    $percent = "\x00%";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";

This behavior (of mb_stripos) still occurs even if the text encoding is
UTF-8, because case folding is still needed to make the search
case-insensitive.

It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.

This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.

This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.

Closes phpGH-9613.
alexdowad added a commit to alexdowad/php-src that referenced this issue Dec 15, 2022
…es of mb_substitute_char

In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.

The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.

Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.

Example:

    <?php
    $questionMark = "\x00?";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";

Incidentally, this behavior does not occur if the text encoding is
UTF-8, because no conversion is needed in that case.

mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.

Example:

    <?php
    mb_substitute_character(0x25); // '%'
    $percent = "\x00%";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";

This behavior (of mb_stripos) still occurs even if the text encoding is
UTF-8, because case folding is still needed to make the search
case-insensitive.

It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.

This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.

This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.

Closes phpGH-9613.
alexdowad added a commit to alexdowad/php-src that referenced this issue Dec 15, 2022
…es of mb_substitute_char

In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.

The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.

Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.

Example:

    <?php
    $questionMark = "\x00?";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";

Incidentally, this behavior does not occur if the text encoding is
UTF-8, because no conversion is needed in that case.

mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.

Example:

    <?php
    mb_substitute_character(0x25); // '%'
    $percent = "\x00%";
    $badUTF16 = "\xDB\x00"; // half of a surrogate pair
    echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
    echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";

This behavior (of mb_stripos) still occurs even if the text encoding is
UTF-8, because case folding is still needed to make the search
case-insensitive.

It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.

This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.

This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.

Closes phpGH-9613.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants