Bug #10584: String.valid_encoding?, String.ascii_only? fails to account for BOM. - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #10584

closed

String.valid_encoding?, String.ascii_only? fails to account for BOM.

Added by geoff-codes (Geoff Nixon) over 10 years ago. Updated almost 3 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 2.2.0preview2 (2014-11-28 trunk 48628) [x86_64-darwin14]

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN

[ruby-core:66761]

Tags:

maybenotbug, encoding

Description

IMO:

A Unicode (UTF-16, UTF-32) string with a valid BOM should not be considered a valid encoding if endianness is changed.
A UTF-8 string with BOM should not consider the BOM as a codepoint.

> file utf-16be-file
utf-16be-file: POSIX shell script, Big-endian UTF-16 Unicode text executable

> file utf-16le-file
utf-16le-file: POSIX shell script, Little-endian UTF-16 Unicode text executable

> file utf-8-with-bom-file
utf-8-with-bom-file: POSIX shell script, UTF-8 Unicode (with BOM) text executable

> ruby -e "p File.binread('utf-16le-file').force_encoding('UTF-16BE').valid_encoding?"
true # false

> ruby -e "p File.binread('utf-16be-file').force_encoding('UTF-16LE').valid_encoding?"
true # false

> ruby -e "p File.read('utf-8-with-bom-file').ascii_only?"
false # true

> ruby -e "p File.read('utf-8-with-bom-file')[0]"
"" # '#'

No?

Files

Download all files

utf-8-with-bom-file (14 Bytes) utf-8-with-bom-file		geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM
utf-16le-file (2.46 KB) utf-16le-file		geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM
utf-16be-file (2.45 KB) utf-16be-file		geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM

Actions

Copy link

#1 [ruby-core:66765]

Updated by duerst (Martin Dürst) over 10 years ago

This isn't as simple as you describe it. With respect to BOMs, there is a clear distinction between external data and internal data. A BOM is often very helpful in external data (e.g. a file). On the other hand, it's not only useless, but actually highly counterproductive for internal data (just think about concatenation).

The problem currently is that Ruby doesn't absorb that difference, it leaves it to the programmer. The reason for this is that it's difficult to define a clear external/internal boundary (the file example is the easy one). Also, some cases require a BOM (e.g. UTF-16 in XML) whereas others forbid it and others allow it and so on. It might be possible to deal with some of this as options on methods reading from files, but that would require careful analysis.

Because U+FFFE isn't a valid codepoint in Unicode, your first two examples could be made true, and might indeed catch some errors. For your third example, a string with a BOM is definitely not ASCII, so ascii_only? should definitely return false. This is not only the definition of ASCII, but also tightly linked to Ruby's internals (including optimizations).

For your forth example, once internal, it's unclear whether the BOM is actually a BOM or a zero-width non-breaking space. The later can appear at the start of a piece of text easily. Although explicitly deprecated, it's still effective, I just used it recently in a Web page.

Actions

Copy link

Updated by naruse (Yui NARUSE) over 7 years ago

Target version deleted (~~2.2.0~~)

Actions

Copy link

#3 [ruby-core:108843]

Updated by mame (Yusuke Endoh) almost 3 years ago

Status changed from Open to Rejected

For the third and forth examples, you can use BOM|UTF-8 encoding.

$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8").ascii_only?'
true
$ ruby -e 'p File.read("utf-8-with-bom-file", encoding: "BOM|UTF-8")[0]'
"#"

For the first and second examples, I think it is a problem of the definition of String#valid_encoding? rather than a BOM. Currently, "\uFFFE".valid_encoding? returns true. (Note that U+FFFE is not a character.) So I think it is considered a spec. If we change it as a new feature, we need to evaluate its value and estimate the impact of compatibility.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0

Project

General

Profile

Ruby

Custom queries

Bug #10584

String.valid_encoding?, String.ascii_only? fails to account for BOM.

Updated by duerst (Martin Dürst) over 10 years ago

Updated by naruse (Yui NARUSE) over 7 years ago

Updated by mame (Yusuke Endoh) almost 3 years ago