Cheaply derive code range for String#b return value

The result of String#b is a string with an ASCII_8BIT/BINARY encoding. That encoding is ASCII-compatible and has no byte sequences that are invalid for the encoding. If we know the receiver's code range, we can derive the resulting string's code range without needing to perform a full code range scan.
author: Kevin Menard <[email protected]> 2022-07-25 21:04:03 -0400
committer: Jean Boussier <[email protected]> 2022-07-26 09:03:44 +0200
commit: 9a8f6e392fbd9c145566ae18fa2128ef96369430 (patch)
tree: 67bcf5b308a88cc915a2b5d3ddbe42c8f506c8d3 /string.c
parent: 9e6d07f3462d29f340114650da9f13a36b866d5f (diff)
1 files changed, 17 insertions, 1 deletions
diff --git a/string.c b/string.c
index f5e089aa21..d9b5278eb6 100644
--- a/string.c
+++ b/string.c
@@ -10779,7 +10779,23 @@ rb_str_b(VALUE str)
         str2 = str_alloc_embed(rb_cString, RSTRING_EMBED_LEN(str) + TERM_LEN(str));
     }
     str_replace_shared_without_enc(str2, str);
-    ENC_CODERANGE_CLEAR(str2);
+
+    // BINARY strings can never be broken; they're either 7-bit ASCII or VALID.
+    // If we know the receiver's code range then we know the result's code range.
+    int cr = ENC_CODERANGE(str);
+    switch (cr) {
+        case ENC_CODERANGE_7BIT:
+            ENC_CODERANGE_SET(str2, ENC_CODERANGE_7BIT);
+            break;
+        case ENC_CODERANGE_BROKEN:
+        case ENC_CODERANGE_VALID:
+            ENC_CODERANGE_SET(str2, ENC_CODERANGE_VALID);
+            break;
+        default:
+            ENC_CODERANGE_CLEAR(str2);
+            break;
+    }
+
     return str2;
 }
author	Kevin Menard <[email protected]>	2022-07-25 21:04:03 -0400
committer	Jean Boussier <[email protected]>	2022-07-26 09:03:44 +0200
commit	9a8f6e392fbd9c145566ae18fa2128ef96369430 (patch)
tree	67bcf5b308a88cc915a2b5d3ddbe42c8f506c8d3 /string.c
parent	9e6d07f3462d29f340114650da9f13a36b866d5f (diff)