-
Notifications
You must be signed in to change notification settings - Fork 7.9k
mb_strpos matches illegal character when needle is '?' #9613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This does not look like a bug, because the actual character encoding of |
@cmb69 Thanks for checking. Please read below🙇 I'm talking about bad bytes, not different encodings. For example, if you send a control character, why? will match. This can be used for some attacks. I create illegal byte from |
https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfl_convert.c#L108 |
Oh, indeed, there is something wrong as of PHP 8.1.0. |
@youkidearitai Thanks for discovering this interesting issue. First, let's be clear about what @youkidearitai is kindly suggesting here. He feels that @youkidearitai further points out a very interesting issue; when php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c Lines 586 to 595 in 63a0775
Note that before performing a search, With, say, UTF-16BE, this would be very possible. If we took U+1234 U+5678 as haystack and U+3456 as needle, a naive byte match would "find" U+3456 by taking the second byte of one code unit and the first byte of the next. With UTF-8, that can't happen. So we can use a simple byte-for-byte match. Note one critical thing here: So why is php-src/ext/mbstring/mbstring.c Lines 4778 to 4792 in 63a0775
That's why. It first converts the haystack and needle to "fold case". The case conversion operation does respect the value of There is one question remaining which I haven't investigated yet. @youkidearitai discovered that the following code gives different results before and after PHP 8.1.0:
In PHP 8.1, this behaves as expected. However, in PHP <= 8.0, it still "finds" a question mark. To figure out the reason, I will need to build a copy of PHP 8.0 and step through this code in a debugger. It appears that for some reason, in PHP 8.0, a question mark is still inserted by the case conversion operation, even though it should respect the value of Will post an update if I find the answer. |
mb_strpos mb_convert_encoding <?php
$input_string = urldecode('%1B');
$needle_exists = '?';
var_dump(
urlencode($input_string), // 27 decimal escape valid UTF-8 or ISO-2022-JP
mb_detect_encoding($input_string, 'ISO-2022-JP'),
urlencode(mb_convert_encoding($input_string, "UTF-8", "ISO-2022-JP")),
mb_substitute_character(),
mb_strpos($input_string, $needle_exists, 0, "ISO-2022-JP")
);
?> Expected result:
PHP >= 8.2rc1:
PHP < 8.2rc1:
strict true mb_detect_encoding 4.3.3 - 4.3.11, 4.4.0 - 4.4.9, 5.0.0 - 5.0.5, 5.1.0 - 5.1.6, 5.2.0 - 5.2.11, 5.3.0 - 5.3.1, 8.1.0 - 8.1.11
|
@hormus Thank you for joining this discussion. I guess you wanted to show another case where In this case, we have an ISO-2022-JP string with only a bare For PHP >= 8.2... see b2f963f. That's where the difference comes from. The reason for this change was for consistency with mbstring's handling of other legacy text encodings (for which a truncated escape sequence is generally treated as an error). Your post indicates that you feel the result of converting a bare Anyways, I think the main issue raised in this thread is about the way @youkidearitai I have thought of another, rather strange, symptom of the way It's definitely a strange behavior, and as @youkidearitai mentioned, it's possible to imagine scenarios where it might even become a security issue. One solution would be to say that Another "solution" is to say: the output of And then the third option is to say that The problem with that is that it still leaves the door open for very surprising behavior, and possibly even security bugs. Say an application receives text from the "outside world" and wants to search for some special character sequence. Say the sequence we want to find is Next thing, the application would probably use Comments, please? |
Just thought of another solution... Instead of (Equivalently, we could add a new, internal error-handling mode which is not selectable by In this way, we avoid the problem of unwanted matches for Further, if the needle is an invalid string, we always return "no match". Further, for Is this a great idea, or is this a great idea? |
Uh... I guess folding case after converting to UTF-8 would require a completely new case-folding function, because the existing one works on wchars. Hmm. But if we add a new, internal error-handling mode as mentioned above, we could use that for the case-folding operation as well, and it would still work. Actually, what we are doing right now in |
Thanks for the thorough analysis, @alexdowad!
I think it is! At least for now we should try to have some sensible behavior wrt. invalid strings which closely matches previous behavior. In the long run, though, I'd prefer to be more strict when invalid strings are passed, but that would need some transition period (maybe deprecate invalid string arguments). As such, documenting the behavior as undefined now, makes sense to me. |
So @cmb69 favors documenting the behavior in the current PHP release as 'undefined', and implementing more reasonable behavior in a future release. I would like to say that rejecting invalid input strings would have an additional downside; it would impose an extra performance burden when working on UTF-8 strings. Currently, if the input strings are UTF-8, we operate on them without performing any checks or conversions, but if we guarantee that invalid strings will be rejected with an error, then we have to run over those UTF-8 strings and check them. Maybe that overhead is worth it? Especially if it helps people to avoid security bugs which might lead to their sites getting hacked... |
Hmm, maybe we could make use of |
😮 I never knew about this! We could use this to make |
@cmb69's suggestion that Maybe I should create an RFC? |
I guess only few know this so far …
… so this would be great! We need to ensure that both PCRE's and MBString's idea of valid UTF-8 matches (I'm not quite sure about which Unicode version are supported by these; likely depends on the version). Not directly related to MBString, but it might make sense to introduce something like |
Not related, but you reminded me of |
These are all the types of errors which PCRE can detect in a UTF-8 string: php-src/ext/pcre/pcre2lib/pcre2_valid_utf.c Lines 111 to 131 in 0001ed2
This looks like the same types of errors which mbstring can also detect... but to be safe, it will be better to fuzz and see if we can find any differences. By the way, when I spoke of submitting an RFC, I wasn't talking about the issue of using |
Naming things is hard :) I think in this case
Thank you for checking! That might already be sufficient (I was mostly concerned that either might not support all 17 planes). Additional fuzzing would be nice, though.
Right, that is what the RFC should be about. However, since performance matters, (not) being able to cache the information about already checked strings appears to be important though (and I wouldn't like to introduce another notion of "valid UTF-8" for the zvals). |
I just tried fuzzing Do you think more fuzzing should be done to make sure? |
I think this is good for now. Thanks for checking! |
…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.
…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that base. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.
…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.
…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.
…es of mb_substitute_char In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes phpGH-9613.
Description
The following code:
Resulted in this output:
But I expected this output instead:
note: mb_stripos is affect mb_substitute_character, but mb_strpos is not affect it is (3v4l: https://3v4l.org/872nB )
PHP Version
PHP 8.1.x
Operating System
Ubuntu 20.04 on WSL2
The text was updated successfully, but these errors were encountered: