-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Optimize String length computation. #1685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JAVA-5842
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a few minor changes requests.
pos += 8; | ||
} | ||
|
||
// Process remaining bytes one-by-one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems one-by-one
is more commonly used as one by one
long word = buffer.getLong(pos); | ||
/* | ||
Subtract 0x0101010101010101L to cause a borrow on 0x00 bytes. | ||
if original byte is 00000000 -> 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional. A little bit less clear than
if original byte is 00000000, then 00000000 - 00000001 = 11111111 (borrow causes high bit set to 1).
result: | ||
00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000 | ||
^^^^^^^^ | ||
The high bit is set only at the 0x00 byte position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional. the above comment seems duplicated and could be deleted
00000000 00000000 11111111 00000000 00000001 00000001 00000000 00000111 | ||
|
||
ANDing mask with 0x8080808080808080 isolates the high bit (0x80) in positions where | ||
the original byte was 0x00, setting the high bit to 1 only at the 0x00 byte position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional. maybe the following verbiage is little bit clearer:
, by setting the high bit to 1 only at the 0x00 byte position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have not internalized the technique, and, therefore, can't check that it's correct (but I don't think spending any more time on the review is warranted).
The code does look like https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/ on its surface. However, based on both Wikipedia and the above link, it may matter whether little- or big-endian byte order is used (ByteBufferBsonInput
explicitly uses the little-endian byte order). Though when I tried playing with the big-endian byte order, the code seemed to still work.
The last reviewed commit is 55ba0b1.
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an optional suggestion, feel free to ignore.
The last reviewed commit is 364a810.
mask &= 0x8080808080808080L; | ||
if (mask != 0) { | ||
/* | ||
* Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LSB is used to denote the bit in a byte; but here we are computing the offset in terms of byte
, so
... gives the byte offset.
is enough and less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the confusion. What It meant is that the UTF-8 bytes are ordered from right to left within the Long (after getLong()
in little-endian), so we effectively scan 'from the LSb' - the offset starts at the least significant bit.
Would this revised comment be clearer? A bit more detailed.
The UTF-8 data is endian-independent and stored in left-to-right order in the buffer, with the first byte at the lowest index. After calling getLong() in little-endian mode, the first UTF-8 byte ends up in the least significant byte of the long (bits 0–7), and the last one in the most significant byte (bits 56–63).
numberOfTrailingZeros scans from the least significant bit (LSb), which aligns with the position of the first UTF-8 byte. We then use
>>> 3
, which means dividing without remainder byLong.BYTES
becauseLong.BYTES
is 2^3, computing the byte offset of the NULL terminator in the original UTF-8 data.
Let me know what you think. @stIncMale @NathanQingyangXu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LSB is used to denote the bit in a byte
LSB may mean both least significant byte and bit, which is why I advocated for using MSb in #1685 (comment). The phrase
Performing >>> 3 (i.e., dividing by 8) gives the byte offset from the LSB.
Is correct and clear, assuming that LSB means "least significant byte".
The new next is also good, and explains things in more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is confusing to me for MSB we used to mean bit
, but here LSB means byte
. We don't need to emphasize byte
for the offset is definitively in terms of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll update the terminology as suggested by @stIncMale : ̶L̶S̶b̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶i̶t̶ ̶a̶n̶d̶ ̶L̶S̶B̶ ̶f̶o̶r̶ ̶l̶e̶a̶s̶t̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶b̶y̶t̶e̶ ̶f̶o̶r̶ ̶b̶e̶t̶t̶e̶r̶ ̶c̶l̶a̶r̶i̶f̶y̶.̶ ̶ removed abbreviations for better clarity of bit/byte
.
If there are no objections, I’ll also add a detailed comment, suggested above, explaining why we use trailingZeros
instead of leadingZeros
.
JAVA-5842
I’ve re-requested review since previous approvals became stale after recent changes. |
result: | ||
00000000 00000000 10000000 00000000 00000000 00000000 00000000 00000000 | ||
^^^^^^^^ | ||
The most significant bit is set only at the 0x00 byte position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"most significant bit" may refer to the most significant bit in a byte, or in a long (the chunk), and without clarified the text is still unclear. Note also how you used "significant byte of the long" below.
The most significant bit is set only at the 0x00 byte position. | |
The most significant bit is set in each 0x00 byte, and only there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I am not so sure that the "only" part is true. I suspect, after the first (left-most, closest to the least significant byte) 0x00
byte, bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00
byte. That's why (again, I suspect) Wikipedia mentions that the technique is applicable only if "the goal is limited to finding the first zero byte on a little-endian processor".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the claim on Wikipedia about "little-endian only" is misleading (maybe a documentation error or just not complete). The algorithm, before numberOfTrailingZeros
, works on both systems (little or big endian), as the zero-byte detection relies on bitwise operations that are endian-agnostic. The only place where byte endianness matters is in numberOfTrailingZeros
or numberOfLeadingZeros
to correctly identify the offset, but Wikipedia does not seem to mention that.
We should probably remove Wikipedia link from Javadoc.
bits may be set on other bytes, but we don't care about that, as we are looking only for the left-most 0x00 byte.
Yes, if the original long contains multiple 0x00
bytes, the MSb will be set for each of those 0x00
bytes.
Here’s a relevant test case for multiple null terminators:
https://github.com/vbabanin/mongo-java-driver/blob/144e287d38697b7af31186cdd815f62735e580b2/driver-core/src/test/unit/com/mongodb/internal/connection/ByteBufferBsonInputTest.java#L665
P.S Applied commit, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the zero-byte detection relies on bitwise operations that are endian-agnostic
I am not sure that's true, as chunk - 0x0101010101010101L
is not a bitwise operation, and we actually treat it as if it's a bytewise operation. If it actually can be treated this way in our circumstances, then it also, obviously, does not depend on the byte order, but the conditions under which it may be treated as a bytewise operation is what matters here. https://jameshfisher.com/2017/01/24/bitwise-check-for-zero-byte/, for example, mentions:
But why were we able to treat the subtraction as byte-wise subtraction? The subtraction algorithm doesn’t work like that! Well, it does work like this for our particular case where there are no zero bytes and we are subtracting 1. It is only when doing 00000000 - 00000001 that the carry bit will be set when crossing the byte boundary.
As I said in #1685 (review), I don't really understand what's going on (for that, I'd have to spend much more time playing with the code and observing intermediate results, and also learn how subtraction works), but it is not impossible that "our particular case where there are no zero bytes and we are subtracting 1" holds only because we are looking for the left-most zero byte in a long, i.e., there are no other zero bytes before it when the -
operation is being executed, and so the carry bits (I don't actually know what they are) from them do not exist and can't affect the outcome. The zero byte we need is going to be the left-most in a long only if the byte order is little endian, which is where the dependency on the byte order may be coming from.
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
Approved. But I don't know if the tests fail because The last reviewed commit is 176db5b. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the updated comments.
JAVA-5842