Bug #10584
closedString.valid_encoding?, String.ascii_only? fails to account for BOM.
Description
IMO:
-
A Unicode (UTF-16, UTF-32) string with a valid BOM should not be considered a valid encoding if endianness is changed.
-
A UTF-8 string with BOM should not consider the BOM as a codepoint.
> file utf-16be-file
utf-16be-file: POSIX shell script, Big-endian UTF-16 Unicode text executable
> file utf-16le-file
utf-16le-file: POSIX shell script, Little-endian UTF-16 Unicode text executable
> file utf-8-with-bom-file
utf-8-with-bom-file: POSIX shell script, UTF-8 Unicode (with BOM) text executable
> ruby -e "p File.binread('utf-16le-file').force_encoding('UTF-16BE').valid_encoding?"
true # false
> ruby -e "p File.binread('utf-16be-file').force_encoding('UTF-16LE').valid_encoding?"
true # false
> ruby -e "p File.read('utf-8-with-bom-file').ascii_only?"
false # true
> ruby -e "p File.read('utf-8-with-bom-file')[0]"
"" # '#'
No?
Files
Updated by duerst (Martin Dürst) over 10 years ago
This isn't as simple as you describe it. With respect to BOMs, there is a clear distinction between external data and internal data. A BOM is often very helpful in external data (e.g. a file). On the other hand, it's not only useless, but actually highly counterproductive for internal data (just think about concatenation).
The problem currently is that Ruby doesn't absorb that difference, it leaves it to the programmer. The reason for this is that it's difficult to define a clear external/internal boundary (the file example is the easy one). Also, some cases require a BOM (e.g. UTF-16 in XML) whereas others forbid it and others allow it and so on. It might be possible to deal with some of this as options on methods reading from files, but that would require careful analysis.
Because U+FFFE isn't a valid codepoint in Unicode, your first two examples could be made true, and might indeed catch some errors. For your third example, a string with a BOM is definitely not ASCII, so ascii_only? should definitely return false. This is not only the definition of ASCII, but also tightly linked to Ruby's internals (including optimizations).
For your forth example, once internal, it's unclear whether the BOM is actually a BOM or a zero-width non-breaking space. The later can appear at the start of a piece of text easily. Although explicitly deprecated, it's still effective, I just used it recently in a Web page.
Updated by mame (Yusuke Endoh) almost 3 years ago
- Status changed from Open to Rejected
For the third and forth examples, you can use BOM|UTF-8
encoding.
$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8").ascii_only?'
true
$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8")[0]'
"#"
For the first and second examples, I think it is a problem of the definition of String#valid_encoding?
rather than a BOM. Currently, "\uFFFE".valid_encoding?
returns true. (Note that U+FFFE
is not a character.) So I think it is considered a spec. If we change it as a new feature, we need to evaluate its value and estimate the impact of compatibility.