Skip to content

Commit 128768a

Browse files
committed
Adjust number of error markers emitted for truncated UTF-8 code units
In 04e59c9, I amended the UTF-8 conversion code, so that when given invalid input, it would emit a number of errors markers harmonizing with the WHATWG's specification of the standard UTF-8 decoding algorithm. (Which, gentle reader of commit logs, you can find online at https://encoding.spec.whatwg.org/#utf-8-decoder.) However, the code in 04e59c9 was faulty in the case that a truncated UTF-8 code unit starts with 0xF1. Then, in dc1ba61, when making a small refactoring to a different part of the UTF-8 conversion code, I inexplicably broke part of the working code, causing the same fault which was already present with truncated UTF-8 code units starting with 0xF1 to also occur with 0xF2 and 0xF3 as well. I don't remember what inane thoughts I was thinking when I pulled off this feat of utter mental confusion. None of these cases were covered by unit tests, by the way. Thankfully, my trusty fuzzer picked up on this when testing the new implementation of mb_parse_str (since the legacy UTF-8 conversion filter did not suffer from the same problem, and I was fuzzing to find any differences in behavior between the old and new implementations). Fortuitously, the fuzzer also picked up another issue which was present in 04e59c9. I was emitting only one error marker for truncated code units starting with 0xE0 or 0xED, in cases where the WHATWG standard indicates two should be emitted. Examples are 0xE0 0x9F <END OF STRING> or 0xED 0xA0 <END OF STRING>. Code units starting with 0xE0-0xED should have 3 bytes. If the first byte is 0xE0, the second MUST be 0xA0 or greater. (Otherwise, the codepoint could have fit in a two-byte code unit.) And if the first byte is 0xED, the second MUST be 0x9F or less. According to the WHATWG algorithm, step 4, if the second byte is outside the legal range, then the decoder should emit an error... AND reprocess the out-of-range byte. The reprocessing will then cause another error. That's why the decoder should indicate two errors and not one.
1 parent a465689 commit 128768a

File tree

4 files changed

+26
-4
lines changed

4 files changed

+26
-4
lines changed

ext/mbstring/libmbfl/filters/mbfilter_utf8.c

+5-2
Original file line numberDiff line numberDiff line change
@@ -256,8 +256,11 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
256256
}
257257
} else {
258258
*out++ = MBFL_BAD_INPUT;
259-
while (p < e && (*p & 0xC0) == 0x80) {
259+
if (p < e && (c != 0xE0 || *p >= 0xA0) && (c != 0xED || *p < 0xA0) && (*p & 0xC0) == 0x80) {
260260
p++;
261+
if (p < e && (*p & 0xC0) == 0x80) {
262+
p++;
263+
}
261264
}
262265
}
263266
} else if (c >= 0xF0 && c <= 0xF4) { /* 4 byte character */
@@ -285,7 +288,7 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
285288
*out++ = MBFL_BAD_INPUT;
286289
if (p < e) {
287290
unsigned char c2 = *p;
288-
if ((c == 0xF0 && c2 >= 0x90) || (c == 0xF4 && c2 < 0x90)) {
291+
if ((c == 0xF0 && c2 >= 0x90) || (c == 0xF4 && c2 < 0x90) || (c >= 0xF1 && c <= 0xF3)) {
289292
while (p < e && (*p & 0xC0) == 0x80) {
290293
p++;
291294
}

ext/mbstring/libmbfl/filters/mbfilter_utf8_mobile.c

+5-2
Original file line numberDiff line numberDiff line change
@@ -362,8 +362,11 @@ static size_t mb_mobile_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32
362362
} else if (c >= 0xE0 && c <= 0xEF) {
363363
if ((e - p) < 2) {
364364
*out++ = MBFL_BAD_INPUT;
365-
while (p < e && (*p & 0xC0) == 0x80) {
365+
if (p < e && (c != 0xE0 || *p >= 0xA0) && (c != 0xED || *p < 0xA0) && (*p & 0xC0) == 0x80) {
366366
p++;
367+
if (p < e && (*p & 0xC0) == 0x80) {
368+
p++;
369+
}
367370
}
368371
continue;
369372
}
@@ -386,7 +389,7 @@ static size_t mb_mobile_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32
386389
*out++ = MBFL_BAD_INPUT;
387390
if (p < e) {
388391
unsigned char c2 = *p;
389-
if ((c == 0xF0 && c2 >= 0x90) || (c == 0xF4 && c2 < 0x90)) {
392+
if ((c == 0xF0 && c2 >= 0x90) || (c == 0xF4 && c2 < 0x90) || (c >= 0xF1 && c <= 0xF3)) {
390393
while (p < e && (*p & 0xC0) == 0x80) {
391394
p++;
392395
}

ext/mbstring/tests/utf8_mobile_encodings.phpt

+8
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,14 @@ $badUTF8 = array(
2727
"\xDF" => "\x00\x00\x00%", // should have been 2-byte
2828
"\xEF\xBF" => "\x00\x00\x00%", // should have been 3-byte
2929
"\xF0\xBF\xBF" => "\x00\x00\x00%", // should have been 4-byte
30+
"\xF1\x96" => "\x00\x00\x00%",
31+
"\xF1\x96\x80" => "\x00\x00\x00%",
32+
"\xF2\x94" => "\x00\x00\x00%",
33+
"\xF2\x94\x80" => "\x00\x00\x00%",
34+
"\xF3\x94" => "\x00\x00\x00%",
35+
"\xF3\x94\x80" => "\x00\x00\x00%",
36+
"\xE0\x9F" => "\x00\x00\x00%\x00\x00\x00%",
37+
"\xED\xA6" => "\x00\x00\x00%\x00\x00\x00%",
3038

3139
// Multi-byte characters which end too soon and go to ASCII
3240
"\xDFA" => "\x00\x00\x00%\x00\x00\x00A",

ext/mbstring/tests/utf_encodings.phpt

+8
Original file line numberDiff line numberDiff line change
@@ -774,6 +774,14 @@ $invalid = array(
774774
"\xDF" => "\x00\x00\x00%", // should have been 2-byte
775775
"\xEF\xBF" => "\x00\x00\x00%", // should have been 3-byte
776776
"\xF0\xBF\xBF" => "\x00\x00\x00%", // should have been 4-byte
777+
"\xF1\x96" => "\x00\x00\x00%",
778+
"\xF1\x96\x80" => "\x00\x00\x00%",
779+
"\xF2\x94" => "\x00\x00\x00%",
780+
"\xF2\x94\x80" => "\x00\x00\x00%",
781+
"\xF3\x94" => "\x00\x00\x00%",
782+
"\xF3\x94\x80" => "\x00\x00\x00%",
783+
"\xE0\x9F" => "\x00\x00\x00%\x00\x00\x00%",
784+
"\xED\xA6" => "\x00\x00\x00%\x00\x00\x00%",
777785

778786
// Multi-byte characters which end too soon and go to ASCII
779787
"\xDFA" => "\x00\x00\x00%\x00\x00\x00A",

0 commit comments

Comments
 (0)