mb_str{i,}pos does not match illegal byte sequences against occurrences of mb_substitute_char

alexdowad · alexdowad · commit 7f4455951673 · 2022-12-18T15:31:20.000+02:00
In GitHub issue 9613, it was reported that mb_strpos wrongly matches the character '?' against any invalid string, even when the character '?' clearly does not appear in the invalid string. This behavior has existed at least since PHP 5.2. The reason for the behavior is that mb_strpos internally converts the haystack and needle to UTF-8 before performing a search. When converting to UTF-8, regardless of the setting of mb_substitute_character, libmbfl would use '?' as an error marker for invalid byte sequences. Once those invalid input sequences were replaced with '?', then naturally, they would match against occurrences of the actual character '?' (when it appeared as a 'normal' character, not as an error marker). This would happen regardless of whether the error was in the haystack and '?' was used in the needle, or whether the error was in the needle and '?' was used in the haystack. Why would libmbfl use '?' rather than the mb_substitute_character set by the user? Remember that libmbfl was originally a separate library which was imported into the PHP codebase. mb_substitute_character is an mbstring API function, not something built into libmbfl. When mbstring would call into libmbfl, it would provide the error replacement character to libmbfl as a parameter. However, when libmbfl would perform conversion operations internally, and not because of a direct call from mbstring, it would use its own error replacement character. Example: <?php $questionMark = "\x00?"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n"; Incidentally, this behavior does not occur if the text encoding is UTF-8, because no conversion is needed in that case. mb_stripos had a similar issue, but instead of always using '?' as an error marker internally, it would use the selected mb_substitute_character. So, for example, if the mb_substitute_character was '%', then occurrences of '%' in the haystack would match invalid bytes in the needle, and vice versa. Example: <?php mb_substitute_character(0x25); // '%' $percent = "\x00%"; $badUTF16 = "\xDB\x00"; // half of a surrogate pair echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n"; echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n"; This behavior (of mb_stripos) still occurs even if the text encoding is UTF-8, because case folding is still needed to make the search case-insensitive. It is not hard to think of scenarios where these strange and unintuitive behaviors could cause security vulnerabilities. In the discussion on GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should simply refuse to operate on invalid strings. However, this would almost certainly break existing production code. This commit mitigates the problem in a less intrusive way: it ensures that while invalid haystacks can match invalid needles (even if the specific invalid bytes are different), invalid bytes in the haystack will never match '?' OR occurrences of the mb_substitute_character in the needle, and vice versa. This does represent a backwards compatibility break, but a small one. Since it mitigates a potential security problem, I believe this is appropriate. Closes GH-9613.
diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
@@ -1923,8 +1923,9 @@ static size_t mb_find_strpos(zend_string *haystack, zend_string *needle, const m
 	unsigned char *offset_pointer;
 
 	if (!php_mb_is_no_encoding_utf8(enc->no_encoding)) {
-		haystack_u8 = php_mb_convert_encoding_ex(ZSTR_VAL(haystack), ZSTR_LEN(haystack), &mbfl_encoding_utf8, enc);
-		needle_u8 = php_mb_convert_encoding_ex(ZSTR_VAL(needle), ZSTR_LEN(needle), &mbfl_encoding_utf8, enc);
+		unsigned int num_errors = 0;
+		haystack_u8 = mb_fast_convert((unsigned char*)ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, 0, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, &num_errors);
+		needle_u8 = mb_fast_convert((unsigned char*)ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, 0, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, &num_errors);
 	} else {
 		haystack_u8 = haystack;
 		needle_u8 = needle;
@@ -4858,8 +4859,8 @@ MBSTRING_API size_t php_mb_stripos(bool mode, zend_string *haystack, zend_string
 {
 	/* We're using simple case-folding here, because we'd have to deal with remapping of
 	 * offsets otherwise. */
-	zend_string *haystack_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, MBSTRG(current_filter_illegal_mode), MBSTRG(current_filter_illegal_substchar));
-	zend_string *needle_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, MBSTRG(current_filter_illegal_mode), MBSTRG(current_filter_illegal_substchar));
+	zend_string *haystack_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(haystack), ZSTR_LEN(haystack), enc, &mbfl_encoding_utf8, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, 0);
+	zend_string *needle_conv = php_unicode_convert_case(PHP_UNICODE_CASE_FOLD_SIMPLE, ZSTR_VAL(needle), ZSTR_LEN(needle), enc, &mbfl_encoding_utf8, MBFL_OUTPUTFILTER_ILLEGAL_MODE_BADUTF8, 0);
 
 	size_t n = mb_find_strpos(haystack_conv, needle_conv, &mbfl_encoding_utf8, offset, mode);
 
diff --git a/ext/mbstring/tests/mb_stripos.phpt b/ext/mbstring/tests/mb_stripos.phpt
@@ -41,7 +41,7 @@ print mb_stripos($euc_jp, 0, -43,       'EUC-JP') . "\n";
 
 
 // Out of range - should return false
-print ("== OUT OF RANGE ==\n");
+print "== OUT OF RANGE ==\n";
 
 $r =  mb_stripos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 40, 'EUC-JP');
 ($r === FALSE) ? print "OK_OUT_RANGE\n"     : print "NG_OUT_RANGE\n";
@@ -100,6 +100,22 @@ $r = mb_stripos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC");
 $r = mb_stripos($euc_jp, "\n");
 ($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
 
+echo "== INVALID STRINGS ==\n";
+
+// Previously, mb_stripos would internally convert invalid byte sequences to the
+// mb_substitute_char BEFORE performing search
+// So invalid byte sequences would match the substitute char, both from haystack
+// to needle and needle to haystack
+
+mb_substitute_character(0x25); // '%'
+var_dump(mb_stripos("abc%%", "\xFF", 0, "UTF-8")); // should be false
+var_dump(mb_stripos("abc\xFF", "%", 0, "UTF-8")); // should be false
+
+// However, invalid byte sequences can still match invalid byte sequences:
+var_dump(mb_stripos("abc\x80\x80", "\xFF", 0, "UTF-8"));
+var_dump(mb_stripos("abc\xFF", "c\x80", 0, "UTF-8"));
+
+
 ?>
 --EXPECT--
 String len: 43
@@ -144,3 +160,8 @@ OK_NEWLINE
 0
 OK_STR
 OK_NEWLINE
+== INVALID STRINGS ==
+bool(false)
+bool(false)
+int(3)
+int(2)
diff --git a/ext/mbstring/tests/mb_strpos.phpt b/ext/mbstring/tests/mb_strpos.phpt
@@ -11,7 +11,7 @@ include_once('common.inc');
 
 
 // Test string
-$euc_jp = '0123����ʸ��������ܸ�Ǥ���EUC-JP��ȤäƤ��ޤ���0123���ܸ�����ݽ�����';
+$euc_jp = "0123\xA4\xB3\xA4\xCE\xCA\xB8\xBB\xFA\xCE\xF3\xA4\xCF\xC6\xFC\xCB\xDC\xB8\xEC\xA4\xC7\xA4\xB9\xA1\xA3EUC-JP\xA4\xF2\xBB\xC8\xA4\xC3\xA4\xC6\xA4\xA4\xA4\xDE\xA4\xB9\xA1\xA30123\xC6\xFC\xCB\xDC\xB8\xEC\xA4\xCF\xCC\xCC\xC5\xDD\xBD\xAD\xA4\xA4\xA1\xA3";
 
 $slen = mb_strlen($euc_jp, 'EUC-JP');
 echo "String len: $slen\n";
@@ -21,11 +21,11 @@ mb_internal_encoding('UTF-8') or print("mb_internal_encoding() failed\n");
 
 echo  "== POSITIVE OFFSET ==\n";
 
-print  mb_strpos($euc_jp, '���ܸ�', 0, 'EUC-JP') . "\n";
+print  mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 0, 'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, '0', 0,     'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, 3, 0,       'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, 0, 0,       'EUC-JP') . "\n";
-print  mb_strpos($euc_jp, '���ܸ�', 15, 'EUC-JP') . "\n";
+print  mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 15, 'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, '0', 15,     'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, 3, 15,       'EUC-JP') . "\n";
 print  mb_strpos($euc_jp, 0, 15,       'EUC-JP') . "\n";
@@ -34,7 +34,7 @@ print  mb_strpos($euc_jp, 0, 15,       'EUC-JP') . "\n";
 // Negative offset
 echo "== NEGATIVE OFFSET ==\n";
 
-print mb_strpos($euc_jp, '���ܸ�', -15, 'EUC-JP') . "\n";
+print mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", -15, 'EUC-JP') . "\n";
 print mb_strpos($euc_jp, '0', -15,     'EUC-JP') . "\n";
 print mb_strpos($euc_jp, 3, -15,       'EUC-JP') . "\n";
 print mb_strpos($euc_jp, 0, -15,       'EUC-JP') . "\n";
@@ -44,7 +44,7 @@ print mb_strpos($euc_jp, 0, -43,       'EUC-JP') . "\n";
 // Non-existent
 echo "== NON-EXISTENT ==\n";
 
-$r = mb_strpos($euc_jp, '�ڹ��', 0, 'EUC-JP');
+$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC", 0, 'EUC-JP');
 ($r === FALSE) ? print "OK_STR\n"     : print "NG_STR\n";
 $r = mb_strpos($euc_jp, "\n",     0, 'EUC-JP');
 ($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
@@ -55,12 +55,12 @@ echo "== NO ENCODING PARAMETER ==\n";
 
 mb_internal_encoding('EUC-JP')  or print("mb_internal_encoding() failed\n");
 
-print  mb_strpos($euc_jp, '���ܸ�', 0) . "\n";
+print  mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC", 0) . "\n";
 print  mb_strpos($euc_jp, '0', 0) . "\n";
 print  mb_strpos($euc_jp, 3, 0) . "\n";
 print  mb_strpos($euc_jp, 0, 0) . "\n";
 
-$r = mb_strpos($euc_jp, '�ڹ��', 0);
+$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC", 0);
 ($r === FALSE) ? print "OK_STR\n"     : print "NG_STR\n";
 $r = mb_strpos($euc_jp, "\n", 0);
 ($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
@@ -70,16 +70,39 @@ echo "== NO OFFSET AND ENCODING PARAMETER ==\n";
 
 mb_internal_encoding('EUC-JP')  or print("mb_internal_encoding() failed\n");
 
-print  mb_strpos($euc_jp, '���ܸ�') . "\n";
+print  mb_strpos($euc_jp, "\xC6\xFC\xCB\xDC\xB8\xEC") . "\n";
 print  mb_strpos($euc_jp, '0') . "\n";
 print  mb_strpos($euc_jp, 3) . "\n";
 print  mb_strpos($euc_jp, 0) . "\n";
 
-$r = mb_strpos($euc_jp, '�ڹ��');
+$r = mb_strpos($euc_jp, "\xB4\xDA\xB9\xF1\xB8\xEC");
 ($r === FALSE) ? print "OK_STR\n"     : print "NG_STR\n";
 $r = mb_strpos($euc_jp, "\n");
 ($r === FALSE) ? print "OK_NEWLINE\n" : print "NG_NEWLINE\n";
 
+echo "== INVALID STRINGS ==\n";
+
+// Previously, mb_strpos would internally convert invalid byte sequences to '?'
+// BEFORE performing search
+// (This was regardless of the setting of mb_substitute_char)
+// So invalid byte sequences would match '?', both from haystack to needle
+// and needle to haystack
+
+var_dump(mb_strpos("abc??", "\xFF", 0, "UTF-8")); // should be false
+var_dump(mb_strpos("abc\xFF", "?", 0, "UTF-8")); // should be false
+
+// However, invalid byte sequences can still match other invalid byte
+// sequences for non-UTF-8 encodings only:
+var_dump(mb_strpos("\x00a\x00b\x00c\xDF\xFF", "\xDB\x00", 0, "UTF-16BE"));
+
+// For UTF-8, invalid byte sequences match the exact same invalid sequence,
+// but not a different one
+var_dump(mb_strpos("abc\x80\x80", "\xFF", 0, "UTF-8")); // should be false
+var_dump(mb_strpos("abc\xFF", "c\x80", 0, "UTF-8")); // should be false
+
+var_dump(mb_strpos("abc\x80\x80", "\x80", 0, "UTF-8"));
+var_dump(mb_strpos("abc\xFF", "c\xFF", 0, "UTF-8"));
+
 ?>
 --EXPECT--
 String len: 43
@@ -115,3 +138,11 @@ OK_NEWLINE
 0
 OK_STR
 OK_NEWLINE
+== INVALID STRINGS ==
+bool(false)
+bool(false)
+int(3)
+bool(false)
+bool(false)
+int(3)
+int(2)