Skip to content

Commit 8f318c3

Browse files
committed
Add specialized UTF-8 validation function for hosts with no SSE2/AVX2 support
In a GitHub thread, Michael Voříšek and Kamil Tekiela mentioned that the PCRE2 function `pcre_match` can be used to validate UTF-8, and that historically it was more efficient than mbstring's `mb_check_encoding`. `mb_check_encoding` is now much faster on hosts with SSE2, and much faster again on hosts with AVX2. However, while all x86-64 CPUs support at least SSE2, not all PHP users run their code on x86-64 hardware. For example, some use recent Macs with ARM CPUs. Therefore, borrow PCRE2's UTF-8 validation function as a fallback for hosts with no SSE2/AVX2 support. On long UTF-8 strings, this code is 50% faster than mbstring's existing fallback code.
1 parent 02bd52b commit 8f318c3

File tree

1 file changed

+77
-3
lines changed

1 file changed

+77
-3
lines changed

ext/mbstring/mbstring.c

+77-3
Original file line numberDiff line numberDiff line change
@@ -4620,8 +4620,8 @@ MBSTRING_API bool php_mb_check_encoding(const char *input, size_t length, const
46204620
* A faster implementation which uses AVX2 instructions follows */
46214621
static bool mb_fast_check_utf8_default(zend_string *str)
46224622
{
4623-
# ifdef __SSE2__
46244623
unsigned char *p = (unsigned char*)ZSTR_VAL(str);
4624+
# ifdef __SSE2__
46254625
/* `e` points 1 byte past the last full 16-byte block of string content
46264626
* Note that we include the terminating null byte which is included in each zend_string
46274627
* as part of the content to check; this ensures that multi-byte characters which are
@@ -4810,8 +4810,82 @@ static bool mb_fast_check_utf8_default(zend_string *str)
48104810

48114811
return true;
48124812
# else
4813-
/* No SSE2 support; we might add generic UTF-8 specific validation code here later */
4814-
return php_mb_check_encoding(ZSTR_VAL(str), ZSTR_LEN(str), &mbfl_encoding_utf8);
4813+
/* This UTF-8 validation function is derived from PCRE2 */
4814+
size_t length = ZSTR_LEN(str);
4815+
/* Table of the number of extra bytes, indexed by the first byte masked with
4816+
0x3f. The highest number for a valid UTF-8 first byte is in fact 0x3d. */
4817+
static const uint8_t utf8_table[] = {
4818+
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
4819+
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
4820+
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
4821+
3,3,3,3,3,3,3,3
4822+
};
4823+
4824+
for (; length > 0; p++) {
4825+
uint32_t d;
4826+
unsigned char c = *p;
4827+
length--;
4828+
4829+
if (c < 128) {
4830+
/* ASCII character */
4831+
continue;
4832+
}
4833+
4834+
if (c < 0xc0) {
4835+
/* Isolated 10xx xxxx byte */
4836+
return false;
4837+
}
4838+
4839+
if (c >= 0xf5) {
4840+
return false;
4841+
}
4842+
4843+
uint32_t ab = utf8_table[c & 0x3f]; /* Number of additional bytes (1-3) */
4844+
if (length < ab) {
4845+
/* Missing bytes */
4846+
return false;
4847+
}
4848+
length -= ab;
4849+
4850+
/* Check top bits in the second byte */
4851+
if (((d = *(++p)) & 0xc0) != 0x80) {
4852+
return false;
4853+
}
4854+
4855+
/* For each length, check that the remaining bytes start with the 0x80 bit
4856+
* set and not the 0x40 bit. Then check for an overlong sequence, and for the
4857+
* excluded range 0xd800 to 0xdfff. */
4858+
switch (ab) {
4859+
case 1:
4860+
/* 2-byte character. No further bytes to check for 0x80. Check first byte
4861+
* for for xx00 000x (overlong sequence). */
4862+
if ((c & 0x3e) == 0) {
4863+
return false;
4864+
}
4865+
break;
4866+
4867+
case 2:
4868+
/* 3-byte character. Check third byte for 0x80. Then check first 2 bytes for
4869+
* 1110 0000, xx0x xxxx (overlong sequence) or 1110 1101, 1010 xxxx (0xd800-0xdfff) */
4870+
if ((*(++p) & 0xc0) != 0x80 || (c == 0xe0 && (d & 0x20) == 0) || (c == 0xed && d >= 0xa0)) {
4871+
return false;
4872+
}
4873+
break;
4874+
4875+
case 3:
4876+
/* 4-byte character. Check 3rd and 4th bytes for 0x80. Then check first 2
4877+
* bytes for for 1111 0000, xx00 xxxx (overlong sequence), then check for a
4878+
* character greater than 0x0010ffff (f4 8f bf bf) */
4879+
if ((*(++p) & 0xc0) != 0x80 || (*(++p) & 0xc0) != 0x80 || (c == 0xf0 && (d & 0x30) == 0) || (c > 0xf4 || (c == 0xf4 && d > 0x8f))) {
4880+
return false;
4881+
}
4882+
break;
4883+
4884+
EMPTY_SWITCH_DEFAULT_CASE();
4885+
}
4886+
}
4887+
4888+
return true;
48154889
# endif
48164890
}
48174891

0 commit comments

Comments
 (0)