Skip to content

Optimize String length computation. #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
May 2, 2025
Merged

Conversation

vbabanin
Copy link
Member

JAVA-5842
Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a few minor changes requests.

pos += 8;
}

// Process remaining bytes one-by-one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems one-by-one is more commonly used as one by one

long word = buffer.getLong(pos);
/*
Subtract 0x0101010101010101L to cause a borrow on 0x00 bytes.
if original byte is 00000000 -> 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional. A little bit less clear than

if original byte is 00000000, then 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1).

result:
00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000
^^^^^^^^
The high bit is set only at the 0x00 byte position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional. the above comment seems duplicated and could be deleted

00000000 00000000 11111111 00000000 00000001 00000001 00000000 00000111

ANDing mask with 0x8080808080808080 isolates the high bit (0x80) in positions where
the original byte was 0x00, setting the high bit to 1 only at the 0x00 byte position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional. maybe the following verbiage is little bit clearer:

, by setting the high bit to 1 only at the 0x00 byte position.

@vbabanin vbabanin requested a review from jyemin April 28, 2025 19:39
Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@katcharov katcharov requested a review from stIncMale April 28, 2025 20:49
Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have not internalized the technique, and, therefore, can't check that it's correct (but I don't think spending any more time on the review is warranted).

The code does look like https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/ on its surface. However, based on both Wikipedia and the above link, it may matter whether little- or big-endian byte order is used (ByteBufferBsonInput explicitly uses the little-endian byte order). Though when I tried playing with the big-endian byte order, the code seemed to still work.

The last reviewed commit is 55ba0b1.

@vbabanin vbabanin requested a review from stIncMale April 30, 2025 01:52
Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an optional suggestion, feel free to ignore.

The last reviewed commit is 364a810.

mask &= 0x8080808080808080L;
if (mask != 0) {
/*
* Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSB is used to denote the bit in a byte; but here we are computing the offset in terms of byte, so

... gives the byte offset.

is enough and less confusing.

Copy link
Member Author

@vbabanin vbabanin Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion. What It meant is that the UTF-8 bytes are ordered from right to left within the Long (after getLong() in little-endian), so we effectively scan 'from the LSb' - the offset starts at the least significant bit.

Would this revised comment be clearer? A bit more detailed.

The UTF-8 data is endian-independent and stored in left-to-right order in the buffer, with the first byte at the lowest index. After calling getLong() in little-endian mode, the first UTF-8 byte ends up in the least significant byte of the long (bits 0–7), and the last one in the most significant byte (bits 56–63).

numberOfTrailingZeros scans from the least significant bit (LSb), which aligns with the position of the first UTF-8 byte. We then use >>> 3, which means dividing without remainder by Long.BYTES because Long.BYTES is 2^3, computing the byte offset of the NULL terminator in the original UTF-8 data.

Let me know what you think. @stIncMale @NathanQingyangXu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSB is used to denote the bit in a byte

LSB may mean both least significant byte and bit, which is why I advocated for using MSb in #1685 (comment). The phrase

Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB.

Is correct and clear, assuming that LSB means "least significant byte".

The new next is also good, and explains things in more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is confusing to me for MSB we used to mean bit, but here LSB means byte. We don't need to emphasize byte for the offset is definitively in terms of bytes.

Copy link
Member Author

@vbabanin vbabanin Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll update the terminology as suggested by @stIncMale : ̶L̶S̶b̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶i̶t̶ ̶a̶n̶d̶ ̶L̶S̶B̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶y̶t̶e̶ ̶f̶o̶r̶ ̶b̶e̶t̶t̶e̶r̶ ̶c̶l̶a̶r̶i̶f̶y̶.̶ ̶ removed abbreviations for better clarity of bit/byte.

If there are no objections, I’ll also add a detailed comment, suggested above, explaining why we use trailingZeros instead of leadingZeros.

@vbabanin vbabanin requested a review from jyemin April 30, 2025 21:00
@vbabanin
Copy link
Member Author

I’ve re-requested review since previous approvals became stale after recent changes.

result:
00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000
^^^^^^^^
The most significant bit is set only at the 0x00 byte position.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"most significant bit" may refer to the most significant bit in a byte, or in a long (the chunk), and without clarified the text is still unclear. Note also how you used "significant byte of the long" below.

Suggested change
The most significant bit is set only at the 0x00 byte position.
The most significant bit is set in each 0x00 byte, and only there.

Copy link
Member

@stIncMale stIncMale Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I am not so sure that the "only" part is true. I suspect, after the first (left-most, closest to the least significant byte) 0x00 byte, bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00 byte. That's why (again, I suspect) Wikipedia mentions that the technique is applicable only if "the goal is limited to finding the first zero byte on a little-endian processor".

Copy link
Member Author

@vbabanin vbabanin Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, the claim on Wikipedia about "little-endian only" is misleading (maybe a documentation error or just not complete). The algorithm, before numberOfTrailingZeros, works on both systems (little or big endian), as the zero-byte detection relies on bitwise operations that are endian-agnostic. The only place where byte endianness matters is in numberOfTrailingZeros or numberOfLeadingZeros to correctly identify the offset, but Wikipedia does not seem to mention that.

We should probably remove Wikipedia link from Javadoc.

bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00 byte.

Yes, if the original long contains multiple 0x00 bytes, the MSb will be set for each of those 0x00 bytes.
Here’s a relevant test case for multiple null terminators:
https://github.com/vbabanin/mongo-java-driver/blob/144e287d38697b7af31186cdd815f62735e580b2/driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputTest.java#L665

P.S Applied commit, thanks!

Copy link
Member

@stIncMale stIncMale May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the zero-byte detection relies on bitwise operations that are endian-agnostic

I am not sure that's true, as chunk - 0x0101010101010101L is not a bitwise operation, and we actually treat it as if it's a bytewise operation. If it actually can be treated this way in our circumstances, then it also, obviously, does not depend on the byte order, but the conditions under which it may be treated as a bytewise operation is what matters here. https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/, for example, mentions:

But why were we able to treat the subtraction as byte-wise subtraction? The subtraction algorithm doesn’t work like that! Well, it does work like this for our particular case where there are no zero bytes and we are subtracting 1. It is only when doing 00000000 - 00000001 that the carry bit will be set when crossing the byte boundary.

As I said in #1685 (review), I don't really understand what's going on (for that, I'd have to spend much more time playing with the code and observing intermediate results, and also learn how subtraction works), but it is not impossible that "our particular case where there are no zero bytes and we are subtracting 1" holds only because we are looking for the left-most zero byte in a long, i.e., there are no other zero bytes before it when the - operation is being executed, and so the carry bits (I don't actually know what they are) from them do not exist and can't affect the outcome. The zero byte we need is going to be the left-most in a long only if the byte order is little endian, which is where the dependency on the byte order may be coming from.

@vbabanin vbabanin requested a review from stIncMale April 30, 2025 23:16
@stIncMale
Copy link
Member

Approved. But I don't know if the tests fail because main is broken, or because of the changes in this PR.

The last reviewed commit is 176db5b.

Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the updated comments.

@vbabanin vbabanin merged commit 290719d into mongodb:main May 2, 2025
237 of 245 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants