-
Notifications
You must be signed in to change notification settings - Fork 545
fix: support 4-byte UTF-8 characters (Emoji) in object keys, for issue #3928 #3929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes issue #3928 by adding support for 4-byte UTF-8 characters (emojis) in JSON object keys. Previously, when parsing JSON with emoji characters in field names, the parser would throw a JSONException for malformed input.
Key Changes
- Added case 15 handling in
getFieldName()method to decode 4-byte UTF-8 sequences into surrogate pairs - Modified character array resizing logic to accommodate surrogate pairs
- Added comprehensive test coverage for emoji characters in object keys with various edge cases
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| core/src/main/java/com/alibaba/fastjson2/JSONReaderUTF8.java | Implements 4-byte UTF-8 decoding (case 15) in getFieldName method, converting to surrogate pairs and adjusting array handling |
| test-jdk17/src/test/java/com/alibaba/fastjson2/issues/Issue3928.java | Adds test cases covering emoji with backslashes, escaped unicode, mixed content, and double backslashes in object keys |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int codePoint = ((ch & 0x07) << 18) | ||
| | ((b1 & 0x3F) << 12) | ||
| | ((b2 & 0x3F) << 6) | ||
| | (b3 & 0x3F); |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 4-byte UTF-8 decoding implementation is missing validation for the valid supplementary code point range. The decoded code point should be validated to ensure it falls within the valid range (0x10000 to 0x10FFFF). This validation is present in similar 4-byte UTF-8 handling code elsewhere in this file (see lines 2782-2787). Without this check, invalid UTF-8 sequences could be incorrectly decoded without throwing an exception.
| | (b3 & 0x3F); | |
| | (b3 & 0x3F); | |
| if (codePoint < 0x10000 || codePoint > 0x10FFFF) { | |
| throw new JSONException("malformed input around byte " + offset); | |
| } |
| int b1 = bytes[offset + 1] & 0xFF; | ||
| int b2 = bytes[offset + 2] & 0xFF; | ||
| int b3 = bytes[offset + 3] & 0xFF; |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing bounds check before accessing bytes array at offset + 1, offset + 2, and offset + 3. If the byte array ends prematurely (i.e., offset + 3 >= bytes.length), this will cause an ArrayIndexOutOfBoundsException. Consider adding a check to ensure offset + 3 < bytes.length or offset + 3 < nameEnd before accessing these bytes.
| @Test | ||
| public void testIssue() { | ||
| Map<String, String> map = Map.of("\uD83D\uDE07\\", ""); | ||
|
|
||
| String jsonString = JSON.toJSONString(map); | ||
| assertEquals(map, JSON.parseObject(jsonString, Map.class)); | ||
|
|
||
| byte[] jsonBytes = JSON.toJSONBytes(map, StandardCharsets.UTF_8); | ||
| assertEquals(map, JSON.parseObject(jsonBytes, Map.class)); | ||
| } |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test coverage is missing for the readFieldNameHashCode method with 4-byte UTF-8 characters in field names. The readFieldNameHashCode method at lines 2572-2590 does not handle case 15 (4-byte UTF-8 sequences) and will throw a JSONException when encountering emoji or other 4-byte UTF-8 characters in field names. This is the same issue that was fixed in getFieldName. Consider adding a test case that exercises readFieldNameHashCode with emoji characters in field names to ensure this method also handles 4-byte UTF-8 correctly.
What this PR does / why we need it?
Summary of your change
Please indicate you've done the following: