Skip to content

Commit 703725e

Browse files
committed
Optimize conversion of CP936 to Unicode
In the previous commit, the branch in mb_strlen which implements the function using the mblen_table (when one is available) was removed. This made mb_strlen faster for just about every legacy text encoding which had an mblen_table... except for CP936, which became much slower. This indicated that our decoding filter for CP936 was slow. I checked and found iterating over the PUA table was a major bottleneck. After optimizing that bottleneck out, benchmarks for text encoding conversion speed were as follows: CP936, short - to UTF-8 - faster by 10.44% (0.0003 vs 0.0003) CP936, short - to UTF-16BE - faster by 11.45% (0.0003 vs 0.0003) CP936, medium - to UTF-8 - faster by 139.09% (0.0012 vs 0.0005) CP936, medium - to UTF-16BE - faster by 140.34% (0.0013 vs 0.0005) CP936, long - to UTF-16BE - faster by 215.88% (0.0538 vs 0.0170) CP936, long - to UTF-8 - faster by 232.41% (0.0528 vs 0.0159) This does not fully express how much faster the CP936 decoder is now, since these conversion benchmarks are not only measuring the speed of decoding CP936, but then also re-encoding the codepoints as UTF-8 or UTF-16. For functions like mb_strlen, which just need to decode but not re-encode the text, the gain in performance is much larger.
1 parent 31e7d6e commit 703725e

File tree

2 files changed

+366
-143
lines changed

2 files changed

+366
-143
lines changed

ext/mbstring/libmbfl/filters/mbfilter_cp936.c

+25-17
Original file line numberDiff line numberDiff line change
@@ -291,37 +291,45 @@ static size_t mb_cp936_to_wchar(unsigned char **in, size_t *in_len, uint32_t *bu
291291
}
292292

293293
unsigned char c2 = *p++;
294+
if (c2 < 0x40 || c2 == 0x7F || c2 == 0xFF) {
295+
*out++ = MBFL_BAD_INPUT;
296+
continue;
297+
}
294298

295-
if (((c >= 0xAA && c <= 0xAF) || (c >= 0xF8 && c <= 0xFE)) && (c2 >= 0xA1 && c2 <= 0xFE)) {
299+
if (((c >= 0xAA && c <= 0xAF) || (c >= 0xF8 && c <= 0xFE)) && c2 >= 0xA1) {
296300
/* UDA part 1, 2: U+E000-U+E4C5 */
297301
*out++ = 94*(c >= 0xF8 ? c - 0xF2 : c - 0xAA) + (c2 - 0xA1) + 0xE000;
298-
} else if (c >= 0xA1 && c <= 0xA7 && c2 >= 0x40 && c2 < 0xA1 && c2 != 0x7F) {
302+
} else if (c >= 0xA1 && c <= 0xA7 && c2 < 0xA1) {
299303
/* UDA part 3: U+E4C6-U+E765*/
300304
*out++ = 96*(c - 0xA1) + c2 - (c2 >= 0x80 ? 0x41 : 0x40) + 0xE4C6;
301305
} else {
302-
unsigned int w = (c << 8) | c2;
303-
304-
if ((w >= 0xA2AB && w <= 0xA9FE) || (w >= 0xD7FA && w <= 0xD7FE) || (w >= 0xFE50 && w <= 0xFEA0)) {
305-
for (int k = 0; k < mbfl_cp936_pua_tbl_max; k++) {
306-
if (w >= mbfl_cp936_pua_tbl[k][2] && w <= mbfl_cp936_pua_tbl[k][2] + mbfl_cp936_pua_tbl[k][1] - mbfl_cp936_pua_tbl[k][0]) {
307-
*out++ = w - mbfl_cp936_pua_tbl[k][2] + mbfl_cp936_pua_tbl[k][0];
308-
goto next_iteration;
306+
unsigned int w = (c - 0x81)*192 + c2 - 0x40; /* Convert c, c2 into GB 2312 table lookup index */
307+
308+
/* For CP936 and GB18030, certain GB 2312 byte combinations are mapped to PUA codepoints,
309+
* whereas the same combinations aren't mapped to any codepoint for HZ and EUC-CN
310+
* To avoid duplicating the entire GB 2312 -> Unicode lookup table, we have three
311+
* auxiliary tables which are consulted instead for specific ranges of lookup indices */
312+
if (w >= 0x192B) {
313+
if (w <= 0x1EBE) {
314+
*out++ = cp936_pua_tbl1[w - 0x192B];
315+
continue;
316+
} else if (w >= 0x413A) {
317+
if (w <= 0x413E) {
318+
*out++ = cp936_pua_tbl2[w - 0x413A];
319+
continue;
320+
} else if (w >= 0x5DD0 && w <= 0x5E20) {
321+
*out++ = cp936_pua_tbl3[w - 0x5DD0];
322+
continue;
309323
}
310324
}
311325
}
312326

313-
if (c < 0xFF && c > 0x80 && c2 >= 0x40 && c2 < 0xFF && c2 != 0x7F) {
314-
w = (c - 0x81)*192 + c2 - 0x40;
315-
ZEND_ASSERT(w < cp936_ucs_table_size);
316-
*out++ = cp936_ucs_table[w];
317-
} else {
318-
*out++ = MBFL_BAD_INPUT;
319-
}
327+
ZEND_ASSERT(w < cp936_ucs_table_size);
328+
*out++ = cp936_ucs_table[w];
320329
}
321330
} else {
322331
*out++ = 0xF8F5;
323332
}
324-
next_iteration: ;
325333
}
326334

327335
*in_len = e - p;

0 commit comments

Comments
 (0)