Skip to content

Commit 092ad3e

Browse files
committedJan 8, 2023
Optimize branch structure of UTF-8 decoder routine
I like the asm which gcc -O3 generates on this modified code... and guess what: my CPU likes it too! (The asm is noticeably tighter, without any extra operations in the path which dispatches to the code for decoding a 1-byte, 2-byte, 3-byte, or 4-byte character. It's just CMP, conditional jump, CMP, conditional jump, CMP, conditional jump. ...Though I was admittedly impressed to see gcc could implement the boolean expression `c >= 0xC2 && c <= 0xDF` with just 3 instructions: add, CMP, then conditional jump. Pretty slick stuff there, guys.) Benchmark results: UTF-8, short - to UTF-16LE faster by 7.36% (0.0001 vs 0.0002) UTF-8, short - to UTF-16BE faster by 6.24% (0.0001 vs 0.0002) UTF-8, medium - to UTF-16BE faster by 4.56% (0.0003 vs 0.0003) UTF-8, medium - to UTF-16LE faster by 4.00% (0.0003 vs 0.0003) UTF-8, long - to UTF-16BE faster by 1.02% (0.0215 vs 0.0217) UTF-8, long - to UTF-16LE faster by 1.01% (0.0209 vs 0.0211)
1 parent d8b5b9f commit 092ad3e

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed
 

‎ext/mbstring/libmbfl/filters/mbfilter_utf8.c

+5-3
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,9 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
225225

226226
if (c < 0x80) {
227227
*out++ = c;
228-
} else if (c >= 0xC2 && c <= 0xDF) { /* 2 byte character */
228+
} else if (c < 0xC2) {
229+
*out++ = MBFL_BAD_INPUT;
230+
} else if (c <= 0xDF) { /* 2 byte character */
229231
if (p < e) {
230232
unsigned char c2 = *p++;
231233
if ((c2 & 0xC0) != 0x80) {
@@ -237,7 +239,7 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
237239
} else {
238240
*out++ = MBFL_BAD_INPUT;
239241
}
240-
} else if (c >= 0xE0 && c <= 0xEF) { /* 3 byte character */
242+
} else if (c <= 0xEF) { /* 3 byte character */
241243
if ((e - p) >= 2) {
242244
unsigned char c2 = *p++;
243245
unsigned char c3 = *p++;
@@ -262,7 +264,7 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
262264
}
263265
}
264266
}
265-
} else if (c >= 0xF0 && c <= 0xF4) { /* 4 byte character */
267+
} else if (c <= 0xF4) { /* 4 byte character */
266268
if ((e - p) >= 3) {
267269
unsigned char c2 = *p++;
268270
unsigned char c3 = *p++;

0 commit comments

Comments
 (0)
Please sign in to comment.