Optimize String length computation. #1685

vbabanin · 2025-04-24T23:31:41Z

JAVA-5842

jyemin

Overall looks good, just a few minor changes requests.

bson/src/main/org/bson/io/ByteBufferBsonInput.java

NathanQingyangXu · 2025-04-28T18:41:08Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+            pos += 8;
+        }
+
+        // Process remaining bytes one-by-one.


seems one-by-one is more commonly used as one by one

NathanQingyangXu · 2025-04-28T18:57:36Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+            long word = buffer.getLong(pos);
+            /*
+              Subtract 0x0101010101010101L to cause a borrow on 0x00 bytes.
+              if original byte is 00000000 -> 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1).


Optional. A little bit less clear than

if original byte is 00000000, then 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1).

NathanQingyangXu · 2025-04-28T18:58:38Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+               result:
+               00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000
+                                 ^^^^^^^^
+                          The high bit is set only at the 0x00 byte position.


Optional. the above comment seems duplicated and could be deleted

NathanQingyangXu · 2025-04-28T18:59:35Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+               00000000 00000000 11111111 00000000 00000001 00000001 00000000 00000111
+
+               ANDing mask with 0x8080808080808080 isolates the high bit (0x80) in positions where
+               the original byte was 0x00, setting the high bit to 1 only at the 0x00 byte position.


Optional. maybe the following verbiage is little bit clearer:

, by setting the high bit to 1 only at the 0x00 byte position.

Co-authored-by: Jeff Yemin <[email protected]>

JAVA-5842

jyemin

LGTM!

JAVA-5842

stIncMale

I still have not internalized the technique, and, therefore, can't check that it's correct (but I don't think spending any more time on the review is warranted).

The code does look like https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/ on its surface. However, based on both Wikipedia and the above link, it may matter whether little- or big-endian byte order is used (ByteBufferBsonInput explicitly uses the little-endian byte order). Though when I tried playing with the big-endian byte order, the code seemed to still work.

The last reviewed commit is 55ba0b1.

bson/src/main/org/bson/io/ByteBufferBsonInput.java

Co-authored-by: Valentin Kovalenko <[email protected]>

JAVA-5842

stIncMale

Left an optional suggestion, feel free to ignore.

The last reviewed commit is 364a810.

NathanQingyangXu · 2025-04-30T13:22:39Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+            mask &= 0x8080808080808080L;
+            if (mask != 0) {
+                /*
+                 *  Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB.


LSB is used to denote the bit in a byte; but here we are computing the offset in terms of byte, so

... gives the byte offset.

is enough and less confusing.

I see the confusion. What It meant is that the UTF-8 bytes are ordered from right to left within the Long (after getLong() in little-endian), so we effectively scan 'from the LSb' - the offset starts at the least significant bit.

Would this revised comment be clearer? A bit more detailed.

The UTF-8 data is endian-independent and stored in left-to-right order in the buffer, with the first byte at the lowest index. After calling getLong() in little-endian mode, the first UTF-8 byte ends up in the least significant byte of the long (bits 0–7), and the last one in the most significant byte (bits 56–63).

numberOfTrailingZeros scans from the least significant bit (LSb), which aligns with the position of the first UTF-8 byte. We then use >>> 3, which means dividing without remainder by Long.BYTES because Long.BYTES is 2^3, computing the byte offset of the NULL terminator in the original UTF-8 data.

Let me know what you think. @stIncMale @NathanQingyangXu

LSB is used to denote the bit in a byte

LSB may mean both least significant byte and bit, which is why I advocated for using MSb in #1685 (comment). The phrase

Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB.

Is correct and clear, assuming that LSB means "least significant byte".

The new next is also good, and explains things in more details.

it is confusing to me for MSB we used to mean bit, but here LSB means byte. We don't need to emphasize byte for the offset is definitively in terms of bytes.

I’ll update the terminology as suggested by @stIncMale : ̶L̶S̶b̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶i̶t̶ ̶a̶n̶d̶ ̶L̶S̶B̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶y̶t̶e̶ ̶f̶o̶r̶ ̶b̶e̶t̶t̶e̶r̶ ̶c̶l̶a̶r̶i̶f̶y̶.̶ ̶ removed abbreviations for better clarity of bit/byte.

If there are no objections, I’ll also add a detailed comment, suggested above, explaining why we use trailingZeros instead of leadingZeros.

JAVA-5842

vbabanin · 2025-04-30T21:10:09Z

I’ve re-requested review since previous approvals became stale after recent changes.

bson/src/main/org/bson/io/ByteBufferBsonInput.java

stIncMale · 2025-04-30T21:52:49Z

bson/src/main/org/bson/io/ByteBufferBsonInput.java

+               result:
+               00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000
+                                 ^^^^^^^^
+               The most significant bit is set only at the 0x00 byte position.


"most significant bit" may refer to the most significant bit in a byte, or in a long (the chunk), and without clarified the text is still unclear. Note also how you used "significant byte of the long" below.

Suggested change

The most significant bit is set only at the 0x00 byte position.

The most significant bit is set in each 0x00 byte, and only there.

Btw, I am not so sure that the "only" part is true. I suspect, after the first (left-most, closest to the least significant byte) 0x00 byte, bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00 byte. That's why (again, I suspect) Wikipedia mentions that the technique is applicable only if "the goal is limited to finding the first zero byte on a little-endian processor".

Interesting, the claim on Wikipedia about "little-endian only" is misleading (maybe a documentation error or just not complete). The algorithm, before numberOfTrailingZeros, works on both systems (little or big endian), as the zero-byte detection relies on bitwise operations that are endian-agnostic. The only place where byte endianness matters is in numberOfTrailingZeros or numberOfLeadingZeros to correctly identify the offset, but Wikipedia does not seem to mention that.

We should probably remove Wikipedia link from Javadoc.

bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00 byte.

Yes, if the original long contains multiple 0x00 bytes, the MSb will be set for each of those 0x00 bytes.
Here’s a relevant test case for multiple null terminators:
https://github.com/vbabanin/mongo-java-driver/blob/144e287d38697b7af31186cdd815f62735e580b2/driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputTest.java#L665

P.S Applied commit, thanks!

the zero-byte detection relies on bitwise operations that are endian-agnostic

I am not sure that's true, as chunk - 0x0101010101010101L is not a bitwise operation, and we actually treat it as if it's a bytewise operation. If it actually can be treated this way in our circumstances, then it also, obviously, does not depend on the byte order, but the conditions under which it may be treated as a bytewise operation is what matters here. https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/, for example, mentions:

But why were we able to treat the subtraction as byte-wise subtraction? The subtraction algorithm doesn’t work like that! Well, it does work like this for our particular case where there are no zero bytes and we are subtracting 1. It is only when doing 00000000 - 00000001 that the carry bit will be set when crossing the byte boundary.

As I said in #1685 (review), I don't really understand what's going on (for that, I'd have to spend much more time playing with the code and observing intermediate results, and also learn how subtraction works), but it is not impossible that "our particular case where there are no zero bytes and we are subtracting 1" holds only because we are looking for the left-most zero byte in a long, i.e., there are no other zero bytes before it when the - operation is being executed, and so the carry bits (I don't actually know what they are) from them do not exist and can't affect the outcome. The zero byte we need is going to be the left-most in a long only if the byte order is little endian, which is where the dependency on the byte order may be coming from.

Co-authored-by: Valentin Kovalenko <[email protected]>

stIncMale · 2025-05-01T00:46:13Z

Approved. But I don't know if the tests fail because main is broken, or because of the changes in this PR.

The last reviewed commit is 176db5b.

jyemin

LGTM. I like the updated comments.

Add SWAR optimization for computing CString length.

85aae59

JAVA-5842

vbabanin assigned vbabanin and NathanQingyangXu Apr 25, 2025

vbabanin requested a review from jyemin April 28, 2025 17:09

vbabanin unassigned NathanQingyangXu Apr 28, 2025

vbabanin requested a review from NathanQingyangXu April 28, 2025 17:10

Change Javadoc.

ad83abc

JAVA-5842

jyemin requested changes Apr 28, 2025

View reviewed changes

NathanQingyangXu approved these changes Apr 28, 2025

View reviewed changes

vbabanin and others added 3 commits April 28, 2025 12:21

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

b1ed69f

Co-authored-by: Jeff Yemin <[email protected]>

Change code comments.

4ce70a2

JAVA-5842

Change code comments.

022f5c1

JAVA-5842

vbabanin requested a review from jyemin April 28, 2025 19:39

Merge branch 'main' into JAVA-5842

674c155

jyemin approved these changes Apr 28, 2025

View reviewed changes

katcharov requested a review from stIncMale April 28, 2025 20:49

Change 'by' to 'thereby' for correct implication.

55ba0b1

JAVA-5842

NathanQingyangXu approved these changes Apr 29, 2025

View reviewed changes

stIncMale requested changes Apr 29, 2025

View reviewed changes

vbabanin and others added 4 commits April 29, 2025 17:10

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

b8ccdfa

Co-authored-by: Valentin Kovalenko <[email protected]>

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

67e2bae

Co-authored-by: Valentin Kovalenko <[email protected]>

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

1bc094c

Co-authored-by: Valentin Kovalenko <[email protected]>

Rename variables and change loop structure.

7be6593

JAVA-5842

vbabanin requested a review from stIncMale April 30, 2025 01:52

Merge branch 'main' into JAVA-5842

364a810

stIncMale approved these changes Apr 30, 2025

View reviewed changes

NathanQingyangXu reviewed Apr 30, 2025

View reviewed changes

Clarify most significant bit and add detailed comment.

5461983

JAVA-5842

vbabanin requested review from NathanQingyangXu and stIncMale April 30, 2025 21:00

vbabanin requested a review from jyemin April 30, 2025 21:00

Remove abbreviations.

8bf0ab1

JAVA-5842

stIncMale requested changes Apr 30, 2025

View reviewed changes

vbabanin and others added 4 commits April 30, 2025 15:16

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

a8e8334

Co-authored-by: Valentin Kovalenko <[email protected]>

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

8a5ce2d

Co-authored-by: Valentin Kovalenko <[email protected]>

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

149f134

Co-authored-by: Valentin Kovalenko <[email protected]>

Update bson/src/main/org/bson/io/ByteBufferBsonInput.java

176db5b

Co-authored-by: Valentin Kovalenko <[email protected]>

vbabanin requested a review from stIncMale April 30, 2025 23:16

stIncMale approved these changes May 1, 2025

View reviewed changes

jyemin approved these changes May 1, 2025

View reviewed changes

Merge branch 'main' into JAVA-5842

2006773

NathanQingyangXu approved these changes May 2, 2025

View reviewed changes

vbabanin merged commit 290719d into mongodb:main May 2, 2025
237 of 245 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize String length computation. #1685

Optimize String length computation. #1685

vbabanin commented Apr 24, 2025

jyemin left a comment

NathanQingyangXu Apr 28, 2025

NathanQingyangXu Apr 28, 2025

NathanQingyangXu Apr 28, 2025

NathanQingyangXu Apr 28, 2025

jyemin left a comment

stIncMale left a comment •

edited

Loading

stIncMale left a comment

NathanQingyangXu Apr 30, 2025

vbabanin Apr 30, 2025 •

edited

Loading

stIncMale Apr 30, 2025

NathanQingyangXu Apr 30, 2025

vbabanin Apr 30, 2025 •

edited

Loading

vbabanin commented Apr 30, 2025

stIncMale Apr 30, 2025

stIncMale Apr 30, 2025 •

edited

Loading

vbabanin Apr 30, 2025 •

edited

Loading

stIncMale May 1, 2025 •

edited

Loading

stIncMale commented May 1, 2025

jyemin left a comment

	The most significant bit is set only at the 0x00 byte position.
	The most significant bit is set in each 0x00 byte, and only there.

Optimize String length computation. #1685

Optimize String length computation. #1685

Conversation

vbabanin commented Apr 24, 2025

jyemin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jyemin left a comment

Choose a reason for hiding this comment

stIncMale left a comment • edited Loading

Choose a reason for hiding this comment

stIncMale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbabanin Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbabanin Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

vbabanin commented Apr 30, 2025

Choose a reason for hiding this comment

stIncMale Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

vbabanin Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

stIncMale May 1, 2025 • edited Loading

Choose a reason for hiding this comment

stIncMale commented May 1, 2025

jyemin left a comment

Choose a reason for hiding this comment

stIncMale left a comment •

edited

Loading

vbabanin Apr 30, 2025 •

edited

Loading

vbabanin Apr 30, 2025 •

edited

Loading

stIncMale Apr 30, 2025 •

edited

Loading

vbabanin Apr 30, 2025 •

edited

Loading

stIncMale May 1, 2025 •

edited

Loading