JavaScript中的Unicode

本文深入探讨JavaScript如何处理Unicode编码,包括源文件编码、内部Unicode使用、字符串中的Unicode、正常化、表情符号处理、字符串长度计算、ES6 Unicode代码点转义及ASCII字符编码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

源文件的Unicode编码 (Unicode encoding of source files)

If not specified otherwise, the browser assumes the source code of any program to be written in the local charset, which varies by country and might give unexpected issues. For this reason, it’s important to set the charset of any JavaScript document.

如果没有另外指定,浏览器将假定要在本地字符集中编写的任何程序的源代码,这会因国家/地区而异,并且可能会引起意外问题。 因此,设置任何JavaScript文档的字符集很重要。

How do you specify another encoding, in particular UTF-8, the most common file encoding on the web?

您如何指定其他编码,尤其是UTF-8,是网络上最常见的文件编码?

If the file contains a BOM character, that has priority on determining the encoding. You can read many different opinions online, some say a BOM in UTF-8 is discouraged, and some editors won’t even add it.

如果文件包含BOM表字符,则在确定编码方面优先。 您可以在线阅读许多不同的意见,有些人建议不要使用UTF-8中的BOM,有些编辑器甚至不会添加它。

This is what the Unicode standard says:

这就是Unicode标准所说的:

… Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

…对于UTF-8,既不需要也不建议使用BOM,但是在从使用BOM的其他编码形式转换UTF-8数据或BOM用作UTF-8签名的情况下,可能会遇到BOM。

This is what the W3C says:

这就是W3C所说的:

In HTML5 browsers are required to recognize the UTF-8 BOM and use it to detect the encoding of the page, and recent versions of major browsers handle the BOM as expected when used for UTF-8 encoded pages. – https://www.w3.org/International/questions/qa-byte-order-mark

在HTML5中,需要浏览器识别UTF-8 BOM并使用它来检测页面的编码,并且主要版本的最新浏览器在用于UTF-8编码页面时会按预期处理BOM。 – https://www.w3.org/International/questions/qa-byte-order-mark

If the file is fetched using HTTP (or HTTPS), the Content-Type header can specify the encoding:

如果使用HTTP(或HTTPS)提取文件,则Content-Type标头可以指定编码:

Content-Type: application/javascript; charset=utf-8

If this is not set, the fallback is to check the charset attribute of the script tag:

如果未设置,则后备方法是检查script标记的charset属性:

<script src="./app.js" charset="utf-8">

If this is not set, the document charset meta tag is used:

如果未设置,则使用文档字符集元标记:

...
<head>
  <meta charset="utf-8" />
</head>
...

The charset attribute in both cases is case insensitive (see the spec)

两种情况下的charset属性都不区分大小写( 请参见规范 )

All this is defined in RFC 4329 “Scripting Media Types”.

所有这些都在RFC 4329“脚本媒体类型”中定义

Public libraries should generally avoid using characters outside the ASCII set in their code, to avoid it being loaded by users with an encoding that is different than their original one, and thus create issues.

公共图书馆通常应避免在其代码中使用ASCII设置以外的字符,以免用户使用与原始编码不同的编码加载该字符,从而造成问题。

JavaScript如何在内部使用Unicode (How JavaScript uses Unicode internally)

While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it.

尽管JavaScript源文件可以具有任何类型的编码,但JavaScript会在执行之前将其内部将其转换为UTF-16。

JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says:

JavaScript字符串都是UTF-16序列,如ECMAScript标准所述:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

当字符串包含实际的文本数据时,每个元素都被视为单个UTF-16代码单元。

在字符串中使用Unicode (Using Unicode in a string)

A unicode sequence can be added inside any string using the format \uXXXX:

可以使用\uXXXX格式将unicode序列添加到任何字符串中:

const s1 = '\u00E9' //é

A sequence can be created by combining two unicode sequences:

可以通过组合两个unicode序列来创建一个序列:

const s2 = '\u0065\u0301' //é

Notice that while both generate an accented e, they are two different strings, and s2 is considered to be 2 characters long:

请注意,虽然两者都生成带重音符号的e,但它们是两个不同的字符串,并且s2被认为是2个字符长:

s1.length //1
s2.length //2

And when you try to select that character in a text editor, you need to go through it 2 times, as the first time you press the arrow key to select it, it just selects half element.

并且,当您尝试在文本编辑器中选择该字符时,需要进行两次操作,因为第一次按箭头键选择该字符时,它只选择了一半元素。

You can write a string combining a unicode character with a plain char, as internally it’s actually the same thing:

您可以编写将unicode字符与纯字符组合在一起的字符串,因为在内部它实际上是一回事:

const s3 = 'e\u0301' //é
s3.length === 2 //true
s2 === s3 //true
s1 !== s3 //true

正常化 (Normalization)

Unicode normalization is the process of removing ambiguities in how a character can be represented, to aid in comparing strings, for example.

Unicode规范化是消除字符表示方式不明确的过程,例如,可以帮助比较字符串。

Like in the example above:

如上例所示:

const s1 = '\u00E9' //é
const s3 = 'e\u0301' //é
s1 !== s3

ES6/ES2015 introduced the normalize() method on the String prototype, so we can do:

ES6 / ES2015在String原型上引入了normalize()方法,因此我们可以这样做:

s1.normalize() === s3.normalize() //true

表情符号 (Emojis)

Emojis are fun, and they are Unicode characters, and as such they are perfectly valid to be used in strings:

表情符号很有趣,它们是Unicode字符,因此,它们完全有效地用于字符串中:

const s4 = '🐶'

Emojis are part of the astral planes, outside of the first Basic Multilingual Plane (BMP), and since those points outside BMP cannot be represented in 16 bits, JavaScript needs to use a combination of 2 characters to represent them

表情符号是星体平面的一部分,位于第一个基本多语言平面(BMP)之外,并且由于BMP之外的那些点无法用16位表示,因此JavaScript需要使用2个字符的组合来表示它们

The 🐶 symbol, which is U+1F436, is traditionally encoded as \uD83D\uDC36 (called surrogate pair). There is a formula to calculate this, but it’s a rather advanced topic.

🐶符号U+1F436传统上编码为\uD83D\uDC36 (称为代理对)。 有一个公式可以计算出来,但这是一个相当高级的话题。

Some emojis are also created by combining together other emojis. You can find those by looking at this list https://unicode.org/emoji/charts/full-emoji-list.html and notice the ones that have more than one item in the unicode symbol column.

通过将其他表情符号组合在一起也可以创建一些表情符号。 您可以通过查看此列表https://unicode.org/emoji/charts/full-emoji-list.html来找到那些,并注意unicode符号列中有一项以上的项目。

👩‍❤️‍👩 is created combining 👩 (\uD83D\uDC69), ❤️‍ (\u200D\u2764\uFE0F\u200D) and another 👩 (\uD83D\uDC69) in a single string: \uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC69

combining‍❤️‍👩是将👩( \uD83D\uDC69 ),❤️‍( \u200D\u2764\uFE0F\u200D )和另一个👩( \uD83D\uDC69 )组合在一个字符串中而创建的: \uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC69

There is no way to make this emoji be counted as 1 character.

无法将此表情符号算作1个字符。

获取正确的字符串长度 (Get the proper length of a string)

If you try to perform

如果您尝试执行

'👩‍❤️‍👩'.length

You’ll get 8 in return, as length counts the single Unicode code points.

作为回报,您将得到8,因为长度是对单个Unicode代码点的计数。

Also, iterating over it is kind of funny:

另外,对其进行迭代也很有趣:

Iterating an emoji

And curiously, pasting this emoji in a password field it’s counted 8 times, possibly making it a valid password in some systems.

奇怪的是,将此表情符号粘贴到密码字段中后,它被计数了8次,在某些系统中可能使其成为有效密码。

How to get the “real” length of a string containing unicode characters?

如何获得包含Unicode字符的字符串的“实际”长度?

One easy way in ES6+ is to use the spread operator:

ES6 +中的一种简单方法是使用散布运算符

;[...'🐶'].length //1

You can also use the Punycode library by Mathias Bynens:

您还可以使用Mathias Bynens的Punycode库

require('punycode').ucs2.decode('🐶').length //1

(Punycode is also great to convert Unicode to ASCII)

(Punycode也非常适合将Unicode转换为ASCII)

Note that emojis that are built by combining other emojis will still give a bad count:

请注意,通过组合其他表情符号构建的表情符号仍然会产生错误的计数:

require('punycode').ucs2.decode('👩‍❤️‍👩').length //6
[...'👩‍❤️‍👩'].length //6

If the string has combining marks however, this still will not give the right count. Check this Glitch https://glitch.com/edit/#!/node-unicode-ignore-marks-in-length as an example.

但是,如果字符串具有组合标记 ,则仍无法正确计数。 请以以下示例检查此Glitch https://glitch.com/edit/#!/node-unicode-ignore-marks-in-length

(you can generate your own weird text with marks here: https://lingojam.com/WeirdTextGenerator)

(您可以在此处生成带有标记的怪异文本: https : //lingojam.com/WeirdTextGenerator )

Length is not the only thing to pay attention. Also reversing a string is error prone if not handled correctly.

长度不是唯一要注意的事情。 如果处理不正确, 反转字符串也很容易出错。

ES6 Unicode代码点转义 (ES6 Unicode code point escapes)

ES6/ES2015 introduced a way to represent Unicode points in the astral planes (any Unicode code point requiring more than 4 chars), by wrapping the code in graph parentheses:

ES6 / ES2015通过将代码包装在图形括号中,引入了一种在星体平面中表示Unicode点的方法(任何需要超过4个字符的Unicode代码点):

'\u{XXXXX}'

The dog 🐶 symbol, which is U+1F436, can be represented as \u{1F436} instead of having to combine two unrelated Unicode code points, like we showed before: \uD83D\uDC36.

狗🐶符号U+1F436可以表示为\u{1F436}而不必像我们之前显示的那样组合两个不相关的Unicode代码点: \uD83D\uDC36

But length calculation still does not work correctly, because internally it’s converted to the surrogate pair shown above.

但是length计算仍然无法正常工作,因为内部已将其转换为上面显示的代理对。

编码ASCII字符 (Encoding ASCII chars)

The first 128 characters can be encoded using the special escaping character \x, which only accepts 2 characters:

可以使用特殊的转义字符\x来编码前128个字符,该字符仅接受2个字符:

'\x61' // a
'\x2A' // *

This will only work from \x00 to \xFF, which is the set of ASCII characters.

这仅适用于\x00\xFF (这是ASCII字符集)。

翻译自: https://flaviocopes.com/javascript-unicode/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值