Skip to content

Commit 7f44559

Browse files
committed
mb_str{i,}pos does not match illegal byte sequences against occurrences of mb_substitute_char
In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes GH-9613.
1 parent 744ca16 commit 7f44559

File tree

3 files changed

+67
-14
lines changed

3 files changed

+67
-14
lines changed

ext/mbstring/mbstring.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1923,8 +1923,9 @@ static size_t mb_find_strpos(zend_string *haystack, zend_string *needle, const m
19231923
unsigned char *offset_pointer;
19241924

19251925
if (!php_mb_is_no_encoding_utf8(enc->no_encoding)) {
1926-
haystack_u8 = php_mb_convert_encoding_ex(ZSTR_VAL(haystack), ZSTR_LEN(haystack), &mbfl_encoding_utf8, enc);
1927-
needle_u8 = php_mb_convert_encoding_ex(ZSTR_VAL(needle), ZSTR_LEN(needle), &mbfl_encoding_utf8, enc);
1926+
unsigned int num_errors = 0;
1927+
haystack_u8 = mb_fast_convert((unsigned char*)ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, 0, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, &num_errors);
1928+
needle_u8 = mb_fast_convert((unsigned char*)ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, 0, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, &num_errors);
19281929
} else {
19291930
haystack_u8 = haystack;
19301931
needle_u8 = needle;
@@ -4858,8 +4859,8 @@ MBSTRING_API size_t php_mb_stripos(bool mode, zend_string *haystack, zend_string
48584859
{
48594860
/* We're using simple case-folding here, because we'd have to deal with remapping of
48604861
* offsets otherwise. */
4861-
zend_string *haystack_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, MBSTRG(current_filter_illegal_mode), MBSTRG(current_filter_illegal_substchar));
4862-
zend_string *needle_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, MBSTRG(current_filter_illegal_mode), MBSTRG(current_filter_illegal_substchar));
4862+
zend_string *haystack_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, 0);
4863+
zend_string *needle_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, 0);
48634864

48644865
size_t n = mb_find_strpos(haystack_conv, needle_conv, &mbfl_encoding_utf8, offset, mode);
48654866

ext/mbstring/tests/mb_stripos.phpt

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ print mb_stripos($euc_jp, 0, -43, 'EUC-JP') . "\n";
4141

4242

4343
// Out of range - should return false
44-
print ("== OUT OF RANGE ==\n");
44+
print "== OUT OF RANGE ==\n";
4545

4646
$r = mb_stripos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 40, 'EUC-JP');
4747
($r === FALSE) ? print "OK_OUT_RANGE\n" : print "NG_OUT_RANGE\n";
@@ -100,6 +100,22 @@ $r = mb_stripos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC");
100100
$r = mb_stripos($euc_jp, "\n");
101101
($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
102102

103+
echo "== INVALID STRINGS ==\n";
104+
105+
// Previously, mb_stripos would internally convert invalid byte sequences to the
106+
// mb_substitute_char BEFORE performing search
107+
// So invalid byte sequences would match the substitute char, both from haystack
108+
// to needle and needle to haystack
109+
110+
mb_substitute_character(0x25); // '%'
111+
var_dump(mb_stripos("abc%%", "\xFF", 0, "UTF-8")); // should be false
112+
var_dump(mb_stripos("abc\xFF", "%", 0, "UTF-8")); // should be false
113+
114+
// However, invalid byte sequences can still match invalid byte sequences:
115+
var_dump(mb_stripos("abc\x80\x80", "\xFF", 0, "UTF-8"));
116+
var_dump(mb_stripos("abc\xFF", "c\x80", 0, "UTF-8"));
117+
118+
103119
?>
104120
--EXPECT--
105121
String len: 43
@@ -144,3 +160,8 @@ OK_NEWLINE
144160
0
145161
OK_STR
146162
OK_NEWLINE
163+
== INVALID STRINGS ==
164+
bool(false)
165+
bool(false)
166+
int(3)
167+
int(2)

ext/mbstring/tests/mb_strpos.phpt

Lines changed: 40 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ include_once('common.inc');
1111

1212

1313
// Test string
14-
$euc_jp = '0123この文字列は日本語です。EUC-JPを使っています。0123日本語は面倒臭い。';
14+
$euc_jp = "0123\xA4\xB3\xA4\xCE\xCA\xB8\xBB\xFA\xCE\xF3\xA4\xCF\xC6\xFC\xCB\xDC\xB8\xEC\xA4\xC7\xA4\xB9\xA1\xA3EUC-JP\xA4\xF2\xBB\xC8\xA4\xC3\xA4\xC6\xA4\xA4\xA4\xDE\xA4\xB9\xA1\xA30123\xC6\xFC\xCB\xDC\xB8\xEC\xA4\xCF\xCC\xCC\xC5\xDD\xBD\xAD\xA4\xA4\xA1\xA3";
1515

1616
$slen = mb_strlen($euc_jp, 'EUC-JP');
1717
echo "String len: $slen\n";
@@ -21,11 +21,11 @@ mb_internal_encoding('UTF-8') or print("mb_internal_encoding() failed\n");
2121

2222
echo "== POSITIVE OFFSET ==\n";
2323

24-
print mb_strpos($euc_jp, '日本語', 0, 'EUC-JP') . "\n";
24+
print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 0, 'EUC-JP') . "\n";
2525
print mb_strpos($euc_jp, '0', 0, 'EUC-JP') . "\n";
2626
print mb_strpos($euc_jp, 3, 0, 'EUC-JP') . "\n";
2727
print mb_strpos($euc_jp, 0, 0, 'EUC-JP') . "\n";
28-
print mb_strpos($euc_jp, '日本語', 15, 'EUC-JP') . "\n";
28+
print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 15, 'EUC-JP') . "\n";
2929
print mb_strpos($euc_jp, '0', 15, 'EUC-JP') . "\n";
3030
print mb_strpos($euc_jp, 3, 15, 'EUC-JP') . "\n";
3131
print mb_strpos($euc_jp, 0, 15, 'EUC-JP') . "\n";
@@ -34,7 +34,7 @@ print mb_strpos($euc_jp, 0, 15, 'EUC-JP') . "\n";
3434
// Negative offset
3535
echo "== NEGATIVE OFFSET ==\n";
3636

37-
print mb_strpos($euc_jp, '日本語', -15, 'EUC-JP') . "\n";
37+
print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", -15, 'EUC-JP') . "\n";
3838
print mb_strpos($euc_jp, '0', -15, 'EUC-JP') . "\n";
3939
print mb_strpos($euc_jp, 3, -15, 'EUC-JP') . "\n";
4040
print mb_strpos($euc_jp, 0, -15, 'EUC-JP') . "\n";
@@ -44,7 +44,7 @@ print mb_strpos($euc_jp, 0, -43, 'EUC-JP') . "\n";
4444
// Non-existent
4545
echo "== NON-EXISTENT ==\n";
4646

47-
$r = mb_strpos($euc_jp, '韓国語', 0, 'EUC-JP');
47+
$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC", 0, 'EUC-JP');
4848
($r === FALSE) ? print "OK_STR\n" : print "NG_STR\n";
4949
$r = mb_strpos($euc_jp, "\n", 0, 'EUC-JP');
5050
($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
@@ -55,12 +55,12 @@ echo "== NO ENCODING PARAMETER ==\n";
5555

5656
mb_internal_encoding('EUC-JP') or print("mb_internal_encoding() failed\n");
5757

58-
print mb_strpos($euc_jp, '日本語', 0) . "\n";
58+
print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 0) . "\n";
5959
print mb_strpos($euc_jp, '0', 0) . "\n";
6060
print mb_strpos($euc_jp, 3, 0) . "\n";
6161
print mb_strpos($euc_jp, 0, 0) . "\n";
6262

63-
$r = mb_strpos($euc_jp, '韓国語', 0);
63+
$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC", 0);
6464
($r === FALSE) ? print "OK_STR\n" : print "NG_STR\n";
6565
$r = mb_strpos($euc_jp, "\n", 0);
6666
($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
@@ -70,16 +70,39 @@ echo "== NO OFFSET AND ENCODING PARAMETER ==\n";
7070

7171
mb_internal_encoding('EUC-JP') or print("mb_internal_encoding() failed\n");
7272

73-
print mb_strpos($euc_jp, '日本語') . "\n";
73+
print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC") . "\n";
7474
print mb_strpos($euc_jp, '0') . "\n";
7575
print mb_strpos($euc_jp, 3) . "\n";
7676
print mb_strpos($euc_jp, 0) . "\n";
7777

78-
$r = mb_strpos($euc_jp, '韓国語');
78+
$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC");
7979
($r === FALSE) ? print "OK_STR\n" : print "NG_STR\n";
8080
$r = mb_strpos($euc_jp, "\n");
8181
($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
8282

83+
echo "== INVALID STRINGS ==\n";
84+
85+
// Previously, mb_strpos would internally convert invalid byte sequences to '?'
86+
// BEFORE performing search
87+
// (This was regardless of the setting of mb_substitute_char)
88+
// So invalid byte sequences would match '?', both from haystack to needle
89+
// and needle to haystack
90+
91+
var_dump(mb_strpos("abc??", "\xFF", 0, "UTF-8")); // should be false
92+
var_dump(mb_strpos("abc\xFF", "?", 0, "UTF-8")); // should be false
93+
94+
// However, invalid byte sequences can still match other invalid byte
95+
// sequences for non-UTF-8 encodings only:
96+
var_dump(mb_strpos("\x00a\x00b\x00c\xDF\xFF", "\xDB\x00", 0, "UTF-16BE"));
97+
98+
// For UTF-8, invalid byte sequences match the exact same invalid sequence,
99+
// but not a different one
100+
var_dump(mb_strpos("abc\x80\x80", "\xFF", 0, "UTF-8")); // should be false
101+
var_dump(mb_strpos("abc\xFF", "c\x80", 0, "UTF-8")); // should be false
102+
103+
var_dump(mb_strpos("abc\x80\x80", "\x80", 0, "UTF-8"));
104+
var_dump(mb_strpos("abc\xFF", "c\xFF", 0, "UTF-8"));
105+
83106
?>
84107
--EXPECT--
85108
String len: 43
@@ -115,3 +138,11 @@ OK_NEWLINE
115138
0
116139
OK_STR
117140
OK_NEWLINE
141+
== INVALID STRINGS ==
142+
bool(false)
143+
bool(false)
144+
int(3)
145+
bool(false)
146+
bool(false)
147+
int(3)
148+
int(2)

0 commit comments

Comments
 (0)