Unicode Demystified A Practical Programmers Guide To The Encoding Standard 1st Edition Richard Gillam instant download
Unicode Demystified A Practical Programmers Guide To The Encoding Standard 1st Edition Richard Gillam instant download
https://ebookbell.com/product/unicode-demystified-a-practical-
programmers-guide-to-the-encoding-standard-1st-edition-richard-
gillam-2451878
Unicode Standard Version 50 The 5th Edition 5th Edition The Unicode
Consortium
https://ebookbell.com/product/unicode-standard-version-50-the-5th-
edition-5th-edition-the-unicode-consortium-2366312
https://ebookbell.com/product/unicode-explained-includes-index-1st-ed-
korpela-jukka-k-11831694
https://ebookbell.com/product/the-unicode-standard-version-40-the-
unicode-consortium-2159298
https://ebookbell.com/product/the-unicode-standard-version-62-core-
specification-edited-by-julie-d-allen-4071642
Fonts Encodings From Unicode To Advanced Typography And Everything In
Between 1st Edition Yannis Haralambous
https://ebookbell.com/product/fonts-encodings-from-unicode-to-
advanced-typography-and-everything-in-between-1st-edition-yannis-
haralambous-22034648
https://ebookbell.com/product/proposition-dajouter-lcriture-tifinaghe-
tifinagh-unicode-propositions-evolution-of-the-tifinagh-script-in-
unicode-p-andries-11949218
Europar 2008 Workshops Parallel Processing Vhpc 2008 Unicore 2008 Hppc
2008 Sgs 2008 Proper 2008 Roia 2008 And Dpa 2008 Las Palmas De Gran
Canaria Spain August 2526 2008 Revised Selected Papers 1st Edition
Michael Alexander
https://ebookbell.com/product/europar-2008-workshops-parallel-
processing-
vhpc-2008-unicore-2008-hppc-2008-sgs-2008-proper-2008-roia-2008-and-
dpa-2008-las-palmas-de-gran-canaria-spain-august-2526-2008-revised-
selected-papers-1st-edition-michael-alexander-2039510
https://ebookbell.com/product/twenty-years-of-health-system-reform-in-
brazil-an-assessment-of-the-sistema-unico-de-saude-couttolenc-5207718
https://ebookbell.com/product/europar-2007-workshops-parallel-
processing-hppc-2007-unicore-summit-2007-and-vhpc-2007-rennes-france-
august-2831-2007-revised-selected-papers-computer-science-and-general-
issues-1st-edition-luc-boug-1293316
Unicode
Demystified
A Practical Programmer’s
Guide to the Encoding Standard
by Richard Gillam
Copyright ©2000–2002 by Richard T. Gillam. All rights reserved.
and
Table of Contents v
Preface xv
About this book xvi
How this book was produced xviii
The author’s journey xviii
Acknowledgements xix
A personal appeal xx
Unicode in Essence An
Architectural
Overview of the Unicode Standard 1
CHAPTER 1 Language, Computers, and Unicode 3
What Unicode Is 6
What Unicode Isn’t 8
The challenge of representing text in computers 10
What This Book Does 14
How this book is organized 15
Section I: Unicode in Essence 15
Section II: Unicode in Depth 16
Section III: Unicode in Action 17
v
Table of Contents
vi Unicode Demystified
Dealing properly with combining character sequences 83
Canonical decompositions 84
Canonical accent ordering 85
Double diacritics 87
Compatibility decompositions 88
Singleton decompositions 90
Hangul 91
Unicode normalization forms 93
Grapheme clusters 94
Unicode in Depth A
Guided Tour of the
Character Repertoire 149
CHAPTER 7 Scripts of Europe 151
The Western alphabetic scripts 151
The Latin alphabet 153
The Latin-1 characters 155
The Latin Extended A block 155
The Latin Extended B block 157
The Latin Extended Additional block 158
The International Phonetic Alphabet 159
Diacritical marks 160
Isolated combining marks 164
Spacing modifier letters 165
The Greek alphabet 166
The Greek block 168
The Greek Extended block 169
The Coptic alphabet 169
The Cyrillic alphabet 170
The Cyrillic block 173
The Cyrillic Supplementary block 173
The Armenian alphabet 174
The Georgian alphabet 175
x Unicode Demystified
Western positional notation 312
Alphabetic numerals 313
Roman numerals 313
Han characters as numerals 314
Other numeration systems 317
Numeric presentation forms 319
National and nominal digit shapes 319
Punctuation 320
Script-specific punctuation 320
The General Punctuation block 322
The CJK Symbols and Punctuation block 323
Spaces 323
Dashes and hyphens 325
Quotation marks, apostrophes, and similar-looking characters 326
Paired punctuation 331
Dot leaders 332
Bullets and dots 332
Special characters 333
Line and paragraph separators 333
Segment and page separators 335
Control characters 336
Characters that control word wrapping 336
Characters that control glyph selection 339
The grapheme joiner 345
Bidirectional formatting characters 346
Deprecated characters 346
Interlinear annotation 347
The object-replacement character 348
The general substitution character 348
Tagging characters 349
Non-characters 351
Symbols used with numbers 351
Numeric punctuation 351
Currency symbols 352
Unit markers 353
Math symbols 353
Mathematical alphanueric symbols 356
Other symbols and miscellaneous characters 357
Musical notation 357
Braille 359
Other symbols 359
Presentation forms 360
Miscellaneous characters 361
Glossary 519
Bibliography 591
The Unicode Standard 591
Other Standards Documents 593
Books and Magazine Articles 593
Unicode Conference papers 594
Other papers 594
Online resources 595
As the ecomonies of the world continue to become more connected together, and as the American
computer market becomes more and more saturated, computer-related businesses are looking more
and more to markets outside the United States to grow their businesses. At the same time, companies
in other industries are not only beginning to do the same thing (or, in fact, have been for a long time),
but are increasingly turning to computer technology, especially the Internet, to grow their businesses
and streamline their operations.
The convergence of these two trends means that it’s no longer just an English-only market for
computer software. More and more, computer software is being used not only by people outside the
United States or by people whose first language isn’t English, but by people who don’t speak English
at all. As a result, interest in software internationalization is growing in the software development
community.
A lot of things are involved in software internationalization: displaying text in the user’s native
language (and in different languages depending on the user), accepting input in the user’s native
language, altering window layouts to accommodate expansion or contraction of text or differences in
writing direction, displaying numeric values acording to local customs, indicating events in time
according to the local calendar systems, and so on.
This book isn’t about any of these things. It’s about something more basic, and which underlies most
of the issues listed above: representing written language in a computer. There are many different
ways to do this; in fact, there are several for just about every language that’s been represented in
computers. In fact, that’s the whole problem. Designing software that’s flexible enough to handle
data in multiple languages (at least multiple languages that use different writing systems) has
traditionally meant not just keeping track of the text, but also keeping track of which encoding
scheme is being used to represent it. And if you want to mix text in multiple writing systems, this
bookkeeping becomes more and more cumbersome.
xv
Table of Contents
The Unicode standard was designed specifically to solve this problem. It aims to be the universal
character encoding standard, providing unique, unambiguous representations for every character in
virtually every writing system and language in the world. The most recent version of Unicode
provides representations for over 90,000 characters.
Unicode has been around for twelve years now and is in its third major revision, adding support for
more languages with each revision. It has gained widespread support in the software community and
is now supported in a wide variety of operating systems, programming languages, and application
programs. Each of the semiannual International Unicode Conferences is better-attended than the
previous one, and the number of presenters and sessions at the Conferences grows correspondingly.
Representing text isn’t as straightforward as it appears at first glance: it’s not merely as simple as
picking out a bunch of characters and assigning numbers to them. First you have to decide what a
“character” is, which isn’t as obvious in many writing systems as it is in English. You have to
contend with things such as how to represent characters with diacrtical marks applied to them, how to
represent clusters of marks that represent syllables, when differently-shaped marks on the page are
different “characters” and when they’re just different ways of writing the same “character,” what
order to store the characters in when they don’t proceed in a straightforward manner from one side of
the page to the other (for example, some characters stack on top of each other, or you have two
parallel lines of characters, or the reading order of the text on the page zigzags around the line
because of differences in natural reading direction), and many similar issues.
The decisions you make on each of these issues for every character affect how various processes,
such as comparing strings or analyzing a string for word boundaries, are performed, making them
more complicated. In addition, the sheer number of different characters representable using the
Unicode standard make many processes on text more complicated.
For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last
version published as a book, is 1,040 pages long. Even at this length, many of the explanations are
fairly concise and assume the reader already has some degree of familiarity with the problems to be
solved. It can be kind of intimidating.
The aim of this book is to provide an easier entrée into the world of Unicode. It arranges things in a
more pedagogical manner, takes more time to explain the various issues and how they’re solved, fills
in various pieces of background information, and adds implementation information and information
on what Unicode support is already out there. It is this author’s hope that this book will be a worthy
companion to the standard itself, and will provide the average programmer and the
internationalization specialist alike with all the information they need to effectively handle Unicode
in their software.
x This book assumes the reader either is a professional computer programmer or is familiar with
most computer-programming concepts and terms. Most general computer-science jargon isn’t
defined or explained here.
In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember,
was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett-Packard)
that was originally formed to create a new operating system for personal computers using state-of-
the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little
too late, but it spawned a lot of technologies that found their places in other products.
For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had
sent many of their best and brightest. If you managed to get a job at Taligent, you had “made it.”
I almost didn’t “make it.” I had wanted to work at Taligent for some time and eventually got the
chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me
about that for years afterward) and wasn’t offered the job. About that same time, a friend of mine did
get a job there, and after the person who did get the job I interviewed for turned it down for family
reasons, my friend put in a good word for me and I got a second chance.
I probably would have taken almost any job there, but the specific opening was in the text and
internationalization group, and thus began my long association with Unicode.
One thing pretty much everybody who ever worked at Taligent will agree on is that working there
was a wonderful learning experience: an opportunity, as it were, to “sit at the feet of the masters.”
Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD
skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing
about written language and software internationalization to… well, I’ll let you be the judge.
My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before
deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets
of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-
storage and internationalization frameworks in IBM’s Open Class Library, various
internationalization facilities in Sun’s Java Class Library (which IBM wrote under contract to Sun),
and the libraries that eventually came to be known as the International Components for Unicode.
International Components for Unicode, or ICU, began life as an IBM developer-tools package based
on the Java internationalization libraries, but has since morphed into an open-soruce project and
taken on a life of its own. It’s gaining increasing popularity and showing up in more operating
systems and software packages, and it’s acquiring a reputation as a great demonstration of how to
implement the various features of the Unicode standard. I had the twin privileges of contributing
frameworks to Java and ICU and of working alongside those who developed the other frameworks
and learning from them. I got to watch the Unicode standard develop, work with some of those who
were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or
two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise
rubbed off on me.
Acknowledgements
It’s been said that it takes a village to raise a child. Well, I don’t really know about that, but it
definitely takes a village to write a book like this. The person whose name is on the cover gets to
take the credit, but there’s an army of people who contribute to the content.
Acknowledgements sections have a bad tendency to sound like acceptance speeches at the Oscars
as the authors go on forever thanking everyone in sight. This set of acknowledgments will be no
different. If you’re bored or annoyed by that sort of thing, I recomend you skip to the next section
now. You have been warned.
Here goes: First and foremost, I’m indebted to the various wonderful and brilliant people I worked
with on the internationalization teams at Taligent and IBM: Mark Davis, Kathleen Wilson, Laura
Werner, Doug Felt, Helena Shih, John Fitzpatrick, Alan Liu, John Raley, Markus Scherer, Eric
Mader, Bertrand Damiba, Stephen Booth, Steven Loomis, Vladimir Weinstein, Judy Lin, Thomas
Scott, Brian Beck, John Jenkins, Deborah Goldsmith, Clayton Lewis, Chen-Lieh Huang, and
Norbert Schatz. Whatever I know about either Unicode or software internationalization I learned
from these people. I’d also like to thank the crew at Sun that we worked with: Brian Beck, Norbert
Lindenberg, Stuart Gill, and John O’Conner.
I’d also like to thank my managment and coworkers at Trilogy, particularly Robin Williamson,
Chris Hyams, Doug Spencer, Marc Surplus, Dave Griffith, John DeRegnaucourt, Bryon Jacob, and
Zach Roadhouse, for their understanding and support as I worked on this book, and especially for
letting me continue to participate in various Unicode-related activities, especially the conferences,
on company time and with company money.
Numerous people have helped me directly in my efforts to put this book together, by reviewing
parts of it and offering advice and corrections, by answering questions and giving advice, by
helping me put together example or letting me use examples they’d already put together, or simply
by offering an encouraging word or two. I’m tremendously indebted to all who helped out in these
ways: Jim Agenbroad, Matitiahu Allouche, Christian Cooke, John Cowan, Simon Cozens, Mark
Davis, Roy Daya, Andy Deitsch, Martin Dürst, Tom Emerson, Franklin Friedmann, David
Gallardo, Tim Greenwood, Jarkko Hietaniemi, Richard Ishida, John Jenkins, Kent Karlsson, Koji
Kodama, Alain LaBonté, Ken Lunde, Rick McGowan, John O’Conner, Chris Pratley, John Raley,
Jonathan Rosenne, Yair Sarig, Dave Thomas, and Garret Wilson. [Be sure to add the names of
anyone who sends me feedback between 12/31/01 and RTP.]
And of course I’d like to acknowledge all the people at Addison-Wesley who’ve had a hand in
putting this thing together: Ross Venables, my current editor; Julie DiNicola, my former editor;
John Fuller, the production manager; [name], who copy-edited the manuscript, [name], who had
the unenviable task of designing and typesetting the manuscript and cleaning up all my diagrams
and examples; [name], who put together the index; [name], who designed the cover; and Mike
Hendrickson, who oversaw the whole thing. I greatly appreciate their professionalism and their
patience in dealing with this first-time author.
And last but not least, I’d like to thank the family members and friends who’ve had to sit and listen
to me talk about this project for the last couple years: especially my parents, Ted and Joyce
Gillam; Leigh Anne and Ken Boynton, Ken Cancelosi, Kelly Bowers, Bruce Rittenhouse, and
many of the people listed above.
As always with these things, I couldn’t have done it without all these people. If this book is good,
they deserve the lion’s share of the credit. If it isn’t, I deserve the blame and owe them my
apologies.
A personal appeal
For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that
time, I wondered if anyone was actually reading the thing and what they thought of it. I would
occasionally get an email from a reader commenting on something I’d written, and I was always
xx Unicode Demystified
Preface
grateful, whether the feedback was good or bad, because it meant someone was reading the thing
and took it seriously enough to let me know what they thought.
I’m hoping there will be more than one edition of this book, and I really want it to be as good as
possible. If you read it and find it less than helpful, I hope you won’t just throw it on a shelf
somewhere and grumble about the money you threw away on it. Please, if this book fails to
adequately answer your questions about Unicode, or if it wastes too much time answering
questions you don’t care about, I want to know. The more specific you can be about just what isn’t
doing it for you, the better. Please write me at [email protected] with your comments
and criticisms.
For that matter, if you like what you see here, I wouldn’t mind hearing from you either. God
knows, I can use the encouragement.
—R. T. G.
Austin, Teaxs
January 2002
Unicode in Essence
An Architectural Overview of the
Unicode Standard
CHAPTER 1 Language, Computers, and
Unicode
Words are amazing, marvelous things. They have the power both to build up and to tear down great
civilizations. Words make us who we are. Indeed, many have observed that our use of language is
what separates humankind from the rest of the animal kingdom. Humans have the capacity for
symbolic thought, the ability to think about and discuss things we cannot immediately see or touch.
Language is our chief symbolic system for doing this. Consider even a fairly simple concept such as
“There is water over the next hill.” Without language, this would be an extraordinarily difficult thing
to convey.
There’s been a lot of talk in recent years about “information.” We face an “information explosion”
and live in an “information age.” Maybe this is true, but when it comes right down to it, we are
creatures of information. And language is one of our main ways both of sharing information with one
another and, often, one of our main ways of processing information.
We often hear that we are in the midst of an “information revolution,” surrounded by new forms of
“information technology,” a phrase that didn’t even exist a generation or two ago. These days, the
term “information technology” is generally used to refer to technology that helps us perform one or
more of three basic processes: the storage and retrieval of information, the extraction of higher levels
of meaning from a collection of information, and the transmission of information over large
distances. The telegraph, and later the telephone, was a quantum leap in the last of these three
processes, and the digital computer in the first two, and these two technologies form the cornerstone
of the modern “information age.”
Yet by far the most important advance in information technology occurred many thousands of years
ago and can’t be credited with an inventor. That advance (like so many technological revolutions,
really a series of smaller advances) was written language. Think about it: before written language,
storing and retrieving information over a long period of time relied mostly on human memory, or on
3
Language, Computers, and Unicode
precursors of writing, such as making notches in sticks. Human memory is unreliable, and storage
and retrieval (or storage over a time longer than a single person’s lifetime) involved direct oral
contact between people. Notches in sticks and the like avoids this problem, but doesn’t allow for a
lot of nuance or depth in the information being stored. Likewise, transmission of information over a
long distance either also required memorization, or relied on things like drums or smoke signals that
also had limited range and bandwidth.
Writing made both of these processes vastly more powerful and reliable. It enabled storage of
information in dramatically greater concentrations and over dramatically longer time spans than was
ever thought possible, and made possible transmission of information over dramatically greater
distances, with greater nuance and fidelity, than had ever been thought possible.
In fact, most of today’s data processing and telecommunications technologies have written language
as their basis. Much of what we do with computers today is use them to store, retrieve, transmit,
produce, analyze, and print written language.
Information technology didn’t begin with the computer, and it didn’t begin with the telephone or
telegraph. It began with written language.
This is a book about how computers are used to store, retrieve, transmit, manipulate, and analyze
written language.
***
Language makes us who we are. The words we choose speak volumes about who we are and about
how we see the world and the people around us. They tell others who “our people” are: where we’re
from, what social class we belong to, possibly even who we’re related to.
The world is made up of hundreds of ethnic groups, who constantly do business with, create alliances
with, commingle with, and go to war with each other. And the whole concept of “ethnic group” is
rooted in language. Who “your people” are is rooted not only in which language you speak, but in
which language or languages your ancestors spoke. As a group’s subgroups become separated, the
languages the two subgroups speak will begin to diverge, eventually giving rise to multiple languages
that share a common heritage (for example, classical Latin diverging into modern Spanish and
French), and as different groups come into contact with each other, their respective languages will
change under each other’s influence (for example, much of modern english vocabulary was borrowed
in from French). Much about a group’s history is encoded in its language.
We live in a world of languages. There are some 6,000 to 7,000 different languages spoken in the
world today, each with countless dialects and regional variations1. We may be united by language,
but we’re divided by our languages.
And yet the world of computing is strangely homogeneous. For decades now, the language of
computing has been English. Specifically, American English. Thankfully, this is changing. One of the
things that information technology is increasingly making possible is contact between people in
different parts of the world. In particular, information technology is making it more and more
1
SIL International’s Ethnologue Web site (www.ethnologue.com) lists 6,800 “main” languages
and 41,000 variants and dialects.
4 Unicode Demystified
Language, Computers, and Unicode
possible to do business in different parts of the world. And as information technology invades more
and more of the world, people are increasingly becoming unwilling to speake the language of
information technology—they want information technology to speak their language.
This is a book about how all, or at least most, written languages—not just English or Western
European languages—can be used with information technology.
***
Language makes us who we are. Almost every human activity has its own language, a specialized
version of a normal human language adapted to the demands of discussing a particular activity. So
the language you speak also says much about what you do.
Every profession has its jargon, which is used both to provide a more precise method of discussing
the various aspects of the profession and to help tell “insiders” from “outsiders.” The information
technology industry has always had a reputation as one of the most jargon-laden professional groups
there is. This probably isn’t true, but it looks that way because of the way computers have come to
permeate our lives: now that non-computer people have to deal with computers, non-computer people
are having to deal with the language that computer people use.
In fact, it’s interesting to watch as the language of information technology starts to infect the
vernacular: “I don’t have the bandwidth to deal with that right now.” “Joe, can you spare some cycles
to talk to me?” And my personal favorite, from a TV commercial from a few years back: “No
shampoo helps download dandruff better.”
What’s interesting is that subspecialties within information technology each have their own jargon as
well that isn’t shared by computer people outside their subspecialty. In the same way that there’s
been a bit of culture clash as the language of computers enters the language of everyday life, there’s
been a bit of culture clash as the language of software internationalization enters the language of
general computing.
This is a good development, because it shows the increasing interest of the computing community in
developing computers, software, and other products that can deal with people in their native
languages. We’re slowly moving from (apologies to Henry Ford) “your native language, as long as
it’s English” to “your native language.” The challenge of writing one piece of software that can deal
with users in multiple human languages involves many different problems that need to be solved, and
each of those problems has its own terminology.
***
One of the biggest problems to be dealt with in software internationalization is that the ways human
language have been traditionally represented inside computers don’t often lend themselves to many
human languages, and they lend themselves especially badly to multiple human languages at the same
time.
Over time, systems have been developed for representing quite a few different written languages in
computers, but each scheme is generally designed for only a single language, or at best a small
collection of related languages, and these systems are mutually incompatible. Interpreting a series of
bits encoded with one standard using the rules of another yields gibberish, so software that handles
multiple languages has traditionally had to do a lot of extra bookkeeping to keep track of the various
different systems used to encode the characters of those languages. This is difficult to do well, and
few pieces of software attempt it, leading to a Balkanization of computer software and the data it
manipulates.
Unicode solves this problem by providing a unified representation for the characters in the various
written languages. By providing a unique bit pattern for every single character, you eliminate the
problem of having to keep track of which of many different characters this particular instance of a
particular bit pattern is supposed to represent.
Of course, each language has its own peculiarities, and presents its own challenges for computerized
representation. Dealing with Unicode doesn’t necessarily mean dealing with all the peculiarities of
the various languages, but it can—it depends on how many languages you actually want to support in
your software, and how much you can rely on other software (such as the operating system) to do that
work for you.
In addition, because of the sheer number of characters it encodes, there are challenges to dealing with
Unicode-encoded text in software that go beyond those of dealing with the various languages it
allows you to represent. The aim of this book is to help the average programmer find his way
through the jargon and understand what goes into dealing with Unicode.
What Unicode Is
Unicode is a standard method for representing written language in computers. So why do we need
this? After all, there are probably dozens, if not hundreds, of ways of doing this already. Well, this is
exactly the point. Unicode isn’t just another in the endless parade of text-encoding standards; it’s an
attempt to do away with all the others, or at least simplify their use, by creating a universal text
encoding standard.
Let’s back up for a second. The best known and most widely used character encoding standard is the
American Standard Code for Information Interchange, or ASCII for short. The first version of ASCII
was published in 1964 as a standard way of representing textual data in computer memory and
sending it over communication links between computers. ASCII is based on a eeven-bit byte. Each
byte represented a character, and characters were represented by assigning them to individual bit
patterns (or, if you prefer, individual numbers). A seven-bit byte can have 128 different bit patterns.
33 of these were set aside for use as control signals of various types (start- and end-of-transmission
codes, block and record separators, etc.), leaving 95 free for representing characters.
Perhaps the main deficiency in ASCII comes from the A in its name: American. ASCII is an
American standard, and was designed for the storage and transmission of English text. 95 characters
are sufficient for representing English text, barely, but that’s it. On early teletype machines, ASCII
could also be used to represent the accented letters found in many European languages, but this
capability disappeared in the transition from teletypes to CRT terminals.
6 Unicode Demystified
Language, Computers, and Unicode
So, as computer use became more and more widespread in different parts of the world, alternative
methods of representing characters in computers arose for representing other languages, leading to
the situation we have today, where there are generally three or four different encoding schemes for
every language and writing system in use today.
Unicode is the latest of several attempts to solve this Tower of Babel problem by creating a universal
character encoding. Its main way of doing this is to increase the size of the possible encoding space
by increasing the number of bits used to encode each character. Most other character encodings are
based upon an eight-bit byte, which provides enough space to encode a maximum of 256 characters
(in practice, most encodings reserve some of these values for control signals and encode fewer than
256 characters). For languages, such as Japanese, that have more than 256 characters, most
encodings are still based on the eight-bit byte, but use sequences of several bytes to represent most of
the characters, using relatively complicated schemes to manage the variable numbers of bytes used to
encode the characters.
Unicode uses a 16-bit word to encode characters, allowing up to 65,536 characters to be encoded
without resorting to more complicated schemes involving multiple machine words per character.
65,000 characters, with careful management, is enough to allow encoding of the vast majority of
characters in the vast majority of written languages in use today. The current version of Unicode,
version 3.2, actually encodes 95,156 different characters—it actually does use a scheme to represent
the less-common characters using two 16-bit units, but with 50,212 characters actually encoded using
only a single unit, you rarely encounter the two-unit characters. In fact, these 50,212 characters
include all of the characters representable with all of the other character encoding methods that are in
reasonably widespread use.
This provides two main benefits: First, a system that wants to allow textual data (either user data or
things like messages and labels that may need to be localized for different user communities) to be in
any language would, without Unicode, have to keep track of not just the text itself, but also of which
character encoding method was being used. In fact, mixtures of languages might require mixtures of
character encodings. This extra bookkeeping means you couldn’t look directly at the text and know,
for example, that the value 65 was a capital letter A. Depending on the encoding scheme used for that
particular piece of text, it might represent some other character, or even be simply part of a character
(i.e., it might have to be considered along with an adjacent byte in order to be interpreted as a
character). This might also mean you’d need different logic to perform certain processes on text
depending on which encoding scheme they happened to use, or convert pieces of text between
different encodings.
Unicode does away with this. It allows all of the same languages and characters to be represented
using only one encoding scheme. Every character has its own unique, unambiguous value. The value
65, for example, always represents the capital letter A. You don’t need to rely on extra information
about the text in order to interpret it, you don’t need different algorithms to perform certain processes
on the text depending on the encoding or language, and you don’t (with some relatively rare
exceptions) need to consider context to correctly interpret any given 16-bit unit of text.
The other thing Unicode gives you is a pivot point for converting between other character encoding
schemes. Because it’s a superset of all of the other common character encoding systems, you can
convert between any other two encodings by converting from one of them to Unicode, and then from
Unicode to the other. Thus, if you have to provide a system that can convert text between any
arbitrary pair of encodings, the number of converters you have to provide can be dramatically
smaller. If you support n different encoding schemes, you only need 2n different converters, not n2
different converters. It also means that when you have to write a system that interacts with the
outside world using several different non-Unicode character representations, it can do its internal
processing in Unicode and convert at the boundaries, rather than potentially having to have alternate
code to do the same things for text in the different outside encodings.
It can be somewhat tough to draw a line between what qualifies as plain text, and therefore should be
encoded in Unicode, and what’s really rich text. In fact, debates on this very subject flare up from
time to time in the various Unicode discussion forums. The basic rule is that plain text contains all of
the information necessary to carry the semantic meaning of the text—the letters, spaces, digits,
punctuation, and so forth. If removing it would make the text unintelligible, then it’s plain text.
This is still a slippery definition. After all, italics and boldface carry semantic information, and losing
them may lose some of the meaning of a sentence that uses them. On the other hand, it’s perfectly
possible to write intelligible, grammatical English without using italics and boldface, where it’d be
impossible, or at least extremely difficult, to write intelligible, grammatical English without the letter
“m”, or the comma, or the digit “3.” Some of it may also come down to user expectation—you can
write intelligible English with only capital letters, but it’s generally not considered grammatical or
acceptable nowadays.
There’s also a certain amount of document structure you need to be able to convey in plain text, even
though document structure is generally considered the province of rich text. The classic example is
the paragraph separator. You can’t really get by without a way to represent a paragraph break in plain
text without compromising legibility, even though it’s technically something that indicates document
structure. But many higher-level protocols that deal with document structure have their own ways of
marking the beginnings and endings of paragraphs. The paragraph separator, thus, is one of a
number of characters in Unicode that are explicitly disallowed (or ignored) in rich-text
representations that are based on Unicode. HTML, for example, allows paragraph marks, but they’re
not recognized by HTML parsers as paragraph marks. Instead, HTML uses the <P> and </P> tags to
mark paragraph boundaries.
When it comes down to it, the distinction between plain text and rich text is a judgment call. It’s kind
of like Potter Stewart’s famous remark on obscenity—“I may not be able to give you a definition, but
I know it when I see it.” Still, the principle is that Unicode encodes only plain text. Unicode may be
used as the basis of a scheme for representing rich text, but isn’t intended as a complete solution to
this problem on its own.
Rich text is an example of a “higher-level protocol,” a phrase you’ll run across a number of times in
the Unicode standard. A higher-level protocol is anything that starts with the Unicode standard as its
basis and then adds additional rules or processes to it. XML, for example, is a higher-level protocol
that uses Unicode as its base and adds rules that define how plain Unicode text can be used to
8 Unicode Demystified
Language, Computers, and Unicode
represent structured information through the use of various kinds of markup tags. To Unicode, the
markup tags are just Unicode text like everything else, but to XML, they delineate the structure of the
document. You can, in fact, have multiple layers or protocols: XHTML is a higher-level protocol for
representing rich text that uses XML as its base.
Markup languages such as HTML and XML are one example of how a higher-level protocol may be
used to represent rich text. The other main class of higher-level protocols involves the use of multiple
data structures, one or more of which contain plain Unicode text, and which are supplemented by
other data structures that contain the information on the document’s structure, the text’s visual
presentation, and any other non-text items that are included with the text. Most word processing
programs use schemes like this.
Another thing Unicode isn’t is a complete solution for software internationalization. Software
internationalization is a set of design practices that lead to software that can be adapted for various
international markets (“localized”) without having to modify the executable code. The Unicode
standard in all of its details includes a lot of stuff, but doesn’t include everything necessary to
produce internationalized software. In fact, it’s perfectly possible to write internationalized software
without using Unicode at all, and also perfectly possible to write completely non-internationalized
software that uses Unicode.
x Presenting a different user interface to the user depending on what language he speaks. This may
involve not only translating any text in the user interface into the user’s language, but also altering
screen layouts to accommodate the size or writing direction of the translated text, changing icons
and other pictorial elements to be meaningful (or not to be offensive) to the target audience,
changing color schemes for the same reasons, and so forth.
x Altering the ways in binary values such as numbers, dates, and times are presented to the user, or
the ways in which the user enters these values into the system. This involves not only relatively
small things, like being able changing the character that’s used for a decimal point (it’s a comma,
not a period, in most of Europe) or the order of the various pieces of a date (day-month-year is
common in Europe), but possibly larger-scale changes (Chinese uses a completely different
system for writing numbers, for example, and Israel uses a completely different calendar system).
x Altering various aspects of your program’s behavior. For example sorting a list into alphabetical
order may produce different orders for the same list depending on language because “alphabetical
order” is a language-specific concept. Accounting software might need to work differently in
different places because of differences in accounting rules.
x And the list goes on…
To those experienced in software internationalization, this is all obvious, of course, but those who
aren’t often seem to use the words “Unicode” and “internationalization” interchangeably. If you’re in
this camp, be careful: if you’re writing in C++, storing all your character strings as arrays of
wchar_t doesn’t make your software internationalized. Likewise, if you’re writing in Java, the fact
that it’s in Java and the Java String class uses Unicode doesn’t automatically make your software
internationalized. If you’re unclear on the internationalization issues you might run into that Unicode
doesn’t solve, you can find an excellent introduction to the subject at http://www.xerox-
emea.com/globaldesign/paper/paper1.htm, along with a wealth of other useful papers and
other goodies.
Finally, another thing Unicode isn’t is a glyph registry. We’ll get into the Unicode character-glyph
model in Chapter 3, but it’s worth a quick synopsis here. Unicode draws a strong, important
distinction between a character, which is an abstract linguistic concept such as “the Latin letter A” or
“the Chinese character for ‘sun,’” and a glyph, which is a concrete visual presentation of a character,
such as A or . There isn’t a one-to-one correspondence between these two concepts: a single glyph
may represent more than one character (such a glyph is often called a ligature), such as the ¿OLJDWXUH
a single mark that represents the letters f and i together. Or a single character might be represented
by two or more glyphs: The vowel sound au in the Tamil language (• ) is represented by two
marks: one that goes to the left of a consonant character, and another on the right, but it’s still
thought of as a single character. A character may also be represented using different glyphs in
different contexts: The Arabic letter heh has one shape when it stands alone ( ) and another when it
occurs in the middle of a word ( ).
You’ll also see what we might consider typeface distinctions between different languages using the
same writing system. For instance, both Arabic and Urdu use the Arabic alphabet, but Urdu is
generally written in the more ornate Nastaliq style, while Arabic frequently isn’t. Japanese and
Chinese are both written using Chinese characters, but some characters have a different shape in
Unicode, as a rule, doesn’t care about any of these distinctions. It encodes underlying semantic
concepts, not visual presentations (characters, not glyphs) and relies on intelligent rendering software
(or the user’s choice of fonts) to draw the correct glyphs in the correct places. Unicode does
sometime encode glyphic distinctions, but only when necessary to preserve interoperability with
some preexisting standard or to preserve legibility (i.e., if smart rendering software can’t pick the
right glyph for a particular character in a particular spot without clues in the encoded text itself).
Despite these exceptions, Unicode by design does not attempt to catalogue every possible variant
shape for a particular character. It encodes the character and leaves the shape to higher-level
protocols.
Let’s take a look at this for a few minutes. The basic principle at work here is simple: If you want to
be able to represent textual information in a computer, you make a list of all the characters you want
to represent and assign each one a number.2 Now you can represent a sequence of characters with a
2
Actually, you assign each one a bit pattern, but numbers are useful surrogates for bit patterns, since
there’s a generally-agreed-upon mapping from numbers to bit patterns. In fact, in some character
encoding standards, including Unicode, there are several alternate ways to represent each character in
bits, all based on the same numbers, and so the numbers you assign to them become useful as an
intermediate stage between characters and bits. We’ll look at this more closely in Chapters 2 and 6.
10 Unicode Demystified
Language, Computers, and Unicode
sequence of numbers. Consider the simple number code we all learned as kids, where A is 1, B is 2,
C is 3, and so on. Using this scheme, the word…
food
…would be represented as 6-15-15-4.
…although you need a lot more numbers (which introduces its own set of problems, which we’ll get
to in a minute).
In real life, you may choose the numbers fairly judiciously, to facilitate things like sorting (it’s useful,
for example, to make the numeric order of the character codes follow the alphabetical order of the
letters) or character-type tests (it makes sense, for example, to put all the digits in one contiguous
group of codes and the letters in another, or even to position them in the encoding space such that
you can check whether a character is a letter or digit with simple bit-maskng). But the basic principle
is still the same: Just assign each character a number.
It starts to get harder, or at least less clear-cut, as you move to other languages. Consider this phrase:
à bientôt
What do you do with the accented letters? You have two basic choices: You can either just assign a
different number to each accented version of a given letter, or you can treat the accent marks as
independent characters and given them their own numbers.
If you take the first approach, a process examining the text (comparing two strings, perhaps) can lose
sight of the fact that a and à are the same letter, possibly causing it to do the wrong thing, without
extra code that knows from extra information that à is just an accented version of a. If you take the
second approach, a keeps its identity, but you then have to make decisions about where the code the
accent goes in the sequence relative to the code for the a, and what tells a system that the accent
belongs on top of the a and not some other letter.
For European languages, though, the first approach (just assigning a new number to the accented
version of a letter) is generally considered to be simpler. But there are other situations…
: -
…such as this Hebrew example, where that approach breaks down. Here, most of the letters have
marks on them, and the same marks can appear on any letter. Assigning a unique code to every
letter-mark combination quickly becomes unwieldy, and you have to go to giving the marks their own
codes.
In fact, Unicode prefers the give-the-marks-their-own-codes approach, but in many cases also
provides unique codes for the more common letter-mark combinations. This means that many
combinations of characters can be represented more than one way. The “à” and “ô” in “à bientôt,” for
example, can be represented either with single character codes, or with pairs of character codes, but
you want “à bientôt” to be treated the same no matter which set of codes is used to represent it, so
this requires a whole bunch of equivalence rules.
The whole idea that you number the characters in a line as they appear from left to right doesn’t just
break down when you add accent marks and the like into the picture. Sometimes, it’s not even
straightforward when all you’re dealing with are letters. In this sentence…
…is in Hindi. The letters in Hindi knot together into clusters representing syllables. The syllables
run from left to right across the page like English text does, but the arrangement of the marks within a
syllable can be complicated and doesn’t necessarily follow the order in which the sounds they
correspond to are actually spoken (or the characters themselves are typed). There are six characters
in this word, arranged like this:
12 Unicode Demystified
Language, Computers, and Unicode
Many writing systems have complicated ordering or shaping behavior, and each presents unique
challenges in detemining how to represent the characters as a linear sequence of bits.
You also run into interesting decisions as to just what you mean by “the same character” or “different
characters.” For example, in this Greek word…
…but can take some very different forms (including the hook) depending on the characters
surrounding it.
You also have the reverse problem of the same shape meaning different things. Chinese characters
often have more than one meaning or pronunciation. Does each different meaning get its own
character code? The letter Å can either be a letter in some Scandinavian languages or the symbol for
the Angstrom unit. Do these two uses get different codes, or is it the same character in both places?
What about this character:
Is this the number 3 or the Russian letter z? Do these share the same character code just because they
happen to look a lot like each other?
For all of these reasons and many others like them, Unicode is more than just a collection of marks
on paper with numbers assigned to them. Every character has a story, and for every character or
group of characters, someone had to sit down and decide whether it was the same as or different from
the other characters in Unicode, whether several related marks got assigned a single number or
several, just what a series of numbers in computer memory would look like when you draw them on
the screen, just how a series of marks on a page would translate into a series of numbers in computer
memory when neither of these mappings was straightforward, how a computer performing various
type of processes on a series of Unicode character codes would do its job, and so on.
So for every character code in the Unicode standard, there are rules about what it means, how it
should look in various situations, how it gets arranged on a line of text with other characters, what
other characters are similar but different, how various text-processing operations should treat it, and
so on. Multiply all these decisions by 94,140 unique character codes, and you begin both to get an
idea of why the standard is so big, and of just how much labor, how much energy, and how much
heartache, on the part of so many people, went into this thing. Unicode is the largest, most
comprehensive, and most carefully designed standard of its type, and the toil of hundreds of people
made it that way.
For a long time, the Unicode standard was not only the definitive source on Unicode, it was the
only source. The problem with this is that the Unicode standard is just that: a standard. Standards
documents are written with people who will implement the standard as their audience. They
assume extensive domain knowledge and are designed to define as precisely as possible every
aspect of the thing being standardized. This makes sense: the whole purpose of a standard is to
ensure that a diverse group of corporations and institutions all do some particular thing in the same
way so that things produced by these different organizations can work together properly. If there
are holes in the definition of the standard, or passages that are open to interpretation, you could
wind up with implementations that conform to the standard, but still don’t work together properly.
Because of this, and because they’re generally written by committees whose members have
different, and often conflicting, agendas, standards tend by their very nature to be dry, turgid,
legalistic, and highly technical documents. They also tend to be organized in a way that
presupposes considerable domain knowledge—if you’re coming to the topic fresh, you’ll often
find that to understand any particular chapter of a standard, you have read every other chapter first.
The Unicode standard is better written than most, but it’s still good bedtime reading—at least if
you don’t mind having nightmares about canonical reordering or the bi-di algorithm. That’s where
this book comes it. It’s intended to act as a companion to the Unicode standard and supplement it
by doing the following:
x Provide a more approachable, and more pedagogically organized, introduction to the salient
features of the Unicode standard.
x Capture in book form changes and additions to the standard since it was last published in book
form, and additions and adjuncts to the standard that haven’t been published in book form.
x Fill in background information about the various features of the standard that are beyond the
scope of the standard itself.
x Provide an introduction to each of the various writing systems Unicode represents and the
encoding and implemented challenges presented by each.
x Provide useful information on implementing various aspects of the standard, or using existing
implementations.
14 Unicode Demystified
Language, Computers, and Unicode
My hope is to provide a good enough introduction to “the big picture” and the main components of
the technology that you can easily make sense of the more detailed descriptions in the standard
itself—or know you don’t have to.
This book is for you if you’re a programmer using any technology that depends on the Unicode
standard for something. It will give you a good introduction to the main concepts of Unicode,
helping you to understand what’s relevant to you and what things to look for in the libraries or
APIs you depend on.
This book is also for you if you’re doing programming work that actually involves implementing
part of the Unicode standard and you’re still relatively new either to Unicode itself or to software
internationalization in general. It will give you most of what you need and enough of a foundation
to be able to find complete and definitive answers in the Unicode standard and its technical
reports.
Chapter 1, the chapter you’re reading , is the book’s introduction. It gives a very high-level
account of the problem Unicode is trying to solve, the goals and non-goals behind the standard,
and the complexity of the problem. It also sets forth the goals and organization of this book.
Chapter 2 puts Unicode in historical context and relates it to the various other character encoding
standards out there. It discusses ISO 10646, Unicode’s sister standard, and its relationship to the
Unicode standard.
Chapter 3 provides a more complete architectural overview. It outlines the structure of the
standard, Unicode’s guiding design principles, and what it means to conform to the Unicode
standard.
Often, it takes two or more Unicode character codes to get a particular effect, and some effects can
be achieved with two or more different sequences of codes. Chapter 4 talks more about this
concept, the combining character sequence, and the extra rules that specify how to deal with
combining character sequences that are equivalent.
Every character in Unicode has a large set of properties that define its semantics and how it should
be treated by various processes. These are all set forth in the Unicode Character Database, and
Chapter 5 introduces the database and all of the various character properties it defines.
The Unicode standard is actually in two layers: A layer that defines a transformation between
written text and a series of abstract numeric codes, and a layer that defines a transformation
between those abstract numeric codes and patterns of bits in memory or in persistent storage. The
lower layer, fromabstract numbers to bits, comprises several different mappings, each optimized
for different situations. Chapter 6 introduces and discusses these mappings.
For example, in Chapter 7 we look at the scripts used to write various European languages. These
scripts generally don’t pose any interesting ordering or shaping problems, but are the only scripts
that have special upper- and lower-case forms. They’re all descended from (or otherwise related
to) the ancient Greek alphabet. This group includes the Latin, Greek, Cyrillic, Armenian, and
Georgian alphabets, as well as various collections of diacritical marks and the International
Phonetic Alphabet.
Chapter 8 looks at the scripts of the Middle East. The biggest feature shared by these scripts is
that they’re written from right to left rather than left to right. They also tend to use letters only for
consonant sounds, using separate marks around the basic letters to represent the vowels. Two
scripts in this group are cursively connected, even in printed text, which poses interesting
representational problems. These scripts are all descended from the ancient Aramaic alphabet.
This group includes the Hebrew, Arabic, Syriac, and Thaana alphabets.
Chapter 9 looks at the scripts of India and Southeast Asia. The letters in these scripts knot
together into clusters that represent whole syllables. The scripts in this group all descend from the
ancient Brahmi script. This group includes the Devanagari script used to write Hindi and Sanskrit,
plus eighteen other scripts, including such things as Thai and Tibetan.
In Chapter 10, we look at the scripts of East Asia. The interesting things here are that these
scripts comprise tends of thousands of unique, and often complicated, characters (the exact number
is impossible to determine, and new characters are coined all the time). These characters are
generally all the same size, don’t combine with each other, and can be written either from left to
right or vertically. This group includes the Chinese characters and various other writing systems
that either are used with Chinese characters or arose under their influence.
While most of the written languages of the world are written using a writing system that falls into
one of the above groups, not all of them do. Chapter 11 discusses the other scripts, including
Mongolian, Ethiopic, Cherokee, and the Unified Canadian Aboriginal Syllabics, a set of characters
used for writing a variety of Native American languages. In addition to the modern scripts,
16 Unicode Demystified
Language, Computers, and Unicode
Unicode also encodes a growing number of scripts that are not used anymore but are of scholarly
interest. The current version of Unicode includes four of there, which are also discussed in
Chapter 11.
But of course, you can’t write only with the characters that represent the sounds or words of
spoken language. You also need things like punctuation marks, numbers, symbols, and various
other non-letter characters. These are covered in Chapter 12, along with various special
formatting and document-structure characters.
Chapter 13 provides an introduction to the subject, discussing a group of generic data structures
and techniques that are useful for various types of processes that operate on Unicode text.
Chapter 14 goes into detail on how to perform various types of transformations and conversions
on Unicode text. This includes converting between the various Unicode serialization formats,
performing Unicode compression and decompression, performing Unicode normalization,
converting between Unicode and other encoding standards, and performing case mapping and case
folding.
Chapter 15 zeros in on two of the most text-analysis processes: searching and sorting. It talks
about both language-senaitive and language-insensitive string comparison and how searching and
sorting algorithms build on language-sensitive string comparison.
Chapter 16 discusses the most important operations performed on text: drawing it on the screen
(or other output devices) and accepting it as input, otherwise known as rendering and editing. It
talks about dividing text up into lines, arranging characters on a line, figuring out what shape to use
for a particular character or sequence of characters, and various special considerations one must
deal with when writing text-editing software.
Finally, in Chapter 17 we look at the place where Unicode intersects with other computer
technologies. It discusses Unicode and the Internet, Unicode and various programming languages,
Unicode and various operating systems, and Unicode and database technology.
To understand Unicode fully, it’s helpful to have a good sense of where we came from, and what this
whole business of character encoding is all about. Unicode didn’t just spring fully-grown from the
forehead of Zeus; it’s the latest step in a history that actually predates the digital computer, having its
roots in telecommunications. Unicode is not the first attempt to solve the problem it solves, and
Unicode is also in its third major revision. To understand the design decisions that led to Unicode
3.0, it’s useful to understand what worked and what didn’t work in Unicode’s many predecessors.
This chapter is entirely background—if you want to jump right in and start looking at the features and
design of Unicode itself, feel free to skip this chapter.
Prehistory
Fortunately, unlike, say, written language itself, the history of electronic (or electrical)
representations of written language doesn’t go back very far. This is mainly, of course, because the
history of the devices using these representations doesn’t go back very far.
The modern age of information technology does, however, start earlier than one might think at first—
a good century or so before the advent of the modern digital computer. We can usefully date the
beginning of modern information technology from Samuel Morse’s invention of the telegraph in
1837.3
3
My main source for this section is Tom Jennings, “Annotated History of Character Codes,” found at
http://www.wps.com/texts/codes.
19
A Brief History of Character Encoding
The telegraph, of course, is more than just an interesting historical curiosity. Telegraphic
communication has never really gone away, although it’s morphed a few times. Even long after the
invention and popularization of the telephone, the successors of the telegraph continued to be used to
send written communication over a wire. Telegraphic communications was used to send large
volumes of text, especially when it needed ultimately to be in written form (news stories, for
example), or when human contact wasn’t especially important and saving money on bandwidth was
very important (especially for routine business communications such as orders and invoices). These
days, email and EDI are more or less the logical descendants of the telegraph.
This approach had died out by the time of Morse’s famous “WHAT HATH GOD WROUGHT”
demonstration in 1844. By this time, the device was being used to send the early version of what we
now know as “Morse code,” which was probably actually devised by Morse’s assistant Alfred Vail.
Morse code was in no way digital in the sense we think of the term—you can’t easily turn it into a
stream of 1 and 0 bits the way you can with many of the succeeding codes. But it was “digital” in the
sense that it was based on a circuit that had only two states, on and off. This is really Morse’s big
innovation; there were telegraph systems prior to Morse, but they were based on sending varying
voltages down the line and deflecting a needle on a gauge of some kind.4 The beauty of Morse’s
scheme is a higher level of error tolerance—it’s a lot easier to tell “on” from “off” than it is to tell
“half on” from “three-fifths” on. This, of course, is also why modern computers are based on binary
numbers.
The difference is that Morse code is based not on a succession of “ons” and “offs,” but on a
succession of “ons” of different lengths, and with some amount of “off” state separating them. You
basically had two types of signal, a long “on” state, usually represented with a dash and pronounced
“dah,” and a short “on” state, usually represented by a dot and pronounced “dit.” Individual letters
were represented with varying-length sequences of dots and dashes.
The lengths of the codes were designed to correspond roughly to the relatively frequencies of the
characters in a transmitted message. Letters were represented with anywhere from one to four dots
and dashes. The two one-signal letters were the two most frequent letters in English: E was
represented with a single dot and T with a single dash. The four two-signal letters were I (. .), A (. –
), N (– .). and M (– –). The least common letters were represented with the longest codes: Z (– – . .),
Y (– . – –), J (. – – –), and Q (– – . –). Digits were represented with sequences of five signals, and
punctuation, which was used sparingly, was represented with sequences of six signals.
The dots and dashes were separated by just enough space to keep everything from running together,
individual characters by longer spaces, and words by even longer spaces.
4
This tidbit comes from Steven J. Searle, “A Brief History of Character Codes,” found at
http://www.tronweb.super-nova-co-jp/characcodehist.html.
20 Unicode Demystified
Prehistory
As an interesting sidelight, the telegraph was designed as a recording instrument—the signal operated
a solenoid that caused a stylus to dig shallow grooves in a moving strip of paper. (The whole thing
with interpreting Morse code by listening to beeping like we’ve all seen in World War II movies
came later, with radio, although experienced telegraph operators could interpret the signal by
listening to the clicking of the stylus.) This is a historical antecedent to he punched-tape systems
used in teletype machines and early computers.
Baudot’s system didn’t use a typewriter keyboard; it used a pianolike keyboard with five keys, each
of which controlled a separate electrical connection. The operator operated two keys with the left
hand and three with the right and sent each character by pressing down some combination of these
five keys simultaneously. (So the “chording keyboards” that are somewhat in vogue today as a way
of combating RSI aren’t a new idea—they go all the way back to 1874.)
The code for each character is thus some combination of the five keys, so you can think of it as a
five-bit code. Of course, this only gives you 32 combinations to play with, kind of a meager
allotment for encoding characters. You can’t even get all the letters and digits in 32 codes.
The solution to this problem has persisted for many years since: you have two separate sets of
characters assigned to the various key combinations, and you steal two key combinations to switch
between them. So you end up with a LTRS bank, consisting of twenty-eight letters (the twenty-six
you’d expect, plus two French letters), and a FIGS bank, consisting of twenty-eight characters: the
ten digits and various punctuation marks and symbols. The three left-hand-only combinations don’t
switch functions: two of them switch back and forth between LTRS and FIGS, and one (both left-
hand keys together) was used to mean “ignore the last character” (this later evolves into the ASCII
DEL character). The thirty-second combination, no keys at all, of course didn’t mean anything—no
keys at all was what separated one character code from the next. (The FIGS and LTRS signals
doubled as spaces.)
So you’d go along in LTRS mode, sending letters. When you came to a number or punctuation mark,
you’d send FIGS, send the number or punctuation, then send LTRS and go back to sending words
again. “I have 23 children.” would thus get sent as
It would have been possible, of course, to get the same effect by just adding a sixth key, but this was
considered too complicated mechanically.
Even though Baudot’s code (actually invented by Johann Gauss and Wilhelm Weber) can be thought
of as a series of five-bit numbers, it isn’t laid out like you might lay out a similar thing today: If you
lay out the code charts according to the binary-number order, it looks jumbled. As with Morse code,
characters were assigned to key combinations in such a way as to minimize fatigue, both to the
operator and to the machinery. More-frequent characters, for example, used fewer fingers than less-
frequent characters.
A couple interesting developments occur first in the Murray code: You see the debut of what later
became known as “format effectors” or “control characters”—the CR and LF codes, which,
respectively return the typewriter carriage to the beginning of the line and advance the platen by one
line.5 Two codes from Baudot also move to the positions where they stayed since (at least until the
introduction of Unicode, by which time the positions no longer mattered): the NULL or BLANK all-
bits-off code and the DEL all-bits-on code. All bits off, fairly logically, meant that the receiving
machine shouldn’t do anything; it was essentially used as an idle code for when no messages were
being sent. On real equipment, you also often had to pad codes that took a long time to execute with
NULLs: If you issued a CR, for example, it’d take a while for the carriage to return to the home
position, and any “real” characters sent during this time would be lost, so you’d sent a bunch of extra
NULLs after the CR. This would put enough space after the CR so that the real characters wouldn’t
go down the line until the receiving machine could print them, and not have any effect (other than
maybe to waste time) if the carriage got there while they were still being sent.
The DEL character would also be ignored by the receeiving equipment. The idea here is that if
you’re using paper tape as an interrmediate storage medium (as we still see today, it became common
to compose a message while off line, storing it on paper tape, and then log on and send the message
from the paper tape, ratherthan “live” from the keyboard) and you make a mistake, the only way to
blank out the mistake is to punch out all the holes in the line with the mistake. So a row with all the
holes punched out (or a character with all the bits set, as we think of it today) was treated as a null
character.
Murray’s code forms the basis of most of the various telegraphy codes of the next fifty years or so.
Western Union picked it up and used it (with a few changes) as its encoding method all the way
through the 1950s. The CCITT (Consultative Committee for International Telephone and Telegraph,
a European standards body) picked up the Western Union code and, with a few changes, blessed it as
an international standard, International Telegraphy Alphabet #2 (“ITA2” for short).
The ITA2 code is often referred to today as “Baudot code,” although it’s significantly different from
Baudot’s code. It does, however, retain many of the most important features of Baudot’s code.
Among the interesting differences between Murray’s code and ITA2 are the addition of more
“control codes”: You see the introduction of an explicit space character, rather than using the all-bits-
off signal, or the LTRS and FIGS signals, as spaces. There’s a new BEL signal, which rings a bell or
5
I’m taking my cue from Jennings here: These code positions were apparently marked “COL” and “LINE
PAGE” originally; Jennings extrapolates back from later codes that had CR and LF in the same positions and
assumes that “COL” and “LINE PAGE” were alternate names for the same functions.
22 Unicode Demystified
Prehistory
produces some other audible signal on the receiving end. And you see the first case of a code that
exists explicitly to control the communications process itself—the WRU, or “Who are you?” code,
which would cause the receiving machine to send some identifying stream of characters back to the
sending machine (this enabled the sender to make sure he was connected to the right receiver before
sending sensitive information down the wire, for example).
This is where inertia sets in in the industry. By the time ITA2 and its narional variants came into use
in the 1930s, you had a significant number of teletype machines out there, and you had the weight of
an international standard behind one encoding method. The ITA2 code would be the code used by
teletype machines right on into the 1960s. When computers started communicating with the outside
world in real time using some kind of terminal, the terminal would be a teletype machine, and the
computer would communicate with it using the teletype codes of the day. (The other main way
computers dealt with alphanumeric data was through the use of punched cards, which had their own
encoding schemes we’ll look at in a minute.)
By the late 1950s, the computer and telecommunications industries were both starting to chafe under
the limitations of the six-bit teletype and punched-card codes of the day, and a new standards effort
was begun that eventually led to the ASCII code. An important predecessor of ASCII was the
FIELDATA code used in various pieces of communications equipment designed and used by the
U.S. Army starting in 1957. (It bled out into civilian life as well; UNIVAC computers of the day
were based on a modified version of the FIELDATA code.)
FIELDATA code was a seven-bit code, but was divided into layers in such a way that you could
think of it as a four-bit code with either two or three control bits appended, similar to punched-card
codes. It’s useful7 to think of it has having a five-bit core somewhat on the ITA2 model with two
control bits. The most significant, or “tag” bit, is used similarly to the LTRS/FIGS bit discussed
before: it switches the other six bits between two banks: the “alphabetic” and “supervisory” banks.
The next-most-significant bit shifted the five core bits between two sub-banks: the alphabetic bank
was shifted between upper-case and lower-case sub-banks, and the supervisory bank between a
supervisory and a numeric/symbols sub-bank.
6
See Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language, first edition (Prentice-Hall,
1978), p. 34.
7
Or at least I think it’s useful, looking at the code charts—my sources don’t describe things this way.
In essence, the extra bit made it possible to include both upper- and lower-case letters for the first
time, and also made it possible to include a wide range of control, or supervisory, codes, which were
used for things like controlling the communication protocol (there were various handshake and
termination codes) and separating various units of variable-length structured data.
Also, within each five-bit bank of characters, we finally see the influence of computer technology:
the characters are ordered in binary-number order.
FIELDATA had a pronounced effect on ASCII, which was being designed at the same time.
Committee X3.4 of the American Standards Association (now the American National Standards
Institute, or ANSI) had been convened in the late 1950s, about the time FIELDATA was deployed,
and consisted of representatives from AT&T, IBM (which, ironically, didn’t actually use ASCII until
the IBM PC came out in 1981), and various other companies from the computer and
telecommunications industries.
The first result of this committee’s efforts was what we now know as ANSI X3.4-1963, the first
version of the American Standard Code for Information Interchange, our old friend ASCII (which,
for the three readers of this book who don’t already know this, is pronounced “ASS-key”). ASCII-
1963 kept the overall structure of FIELDATA and regularized some things. For example, it’s a pure
7-bit code and is laid out in such a way as to make it possible to reasonably sort a list into
alphabetical order by comparing the numeric character codes directly. ASCII-1963 didn’t officially
have the lower-case letters in it, although there was a big undefined space where they were obviously
going to go, and had a couple weird placements, such as a few control codes up in the printing-
character range. These were fixed in the next version of ASCII, ASCII-1967, which is more or less
the ASCII we all know and love today. It standardized the meanings of some of the control codes
that were left open in ASCII-1963, moved all of the control codes (except for DEL, which we talked
about) into the control-code area, and added the lower-case letters and some special programming-
language symbols. It also discarded a couple of programming-language symbols from ASCII-1963
in a bow to international usage: the upward-pointing arrow used for exponentiation turned into the
caret (which did double duty as the circumflex accent), and the left-pointing arrow (used sometimes
to represent assignment) turned into the underscore character.
Punched cards date back at least as far as 1810, when Joseph-Marie Jacquard used them to control
automatic weaving machines, and Charles Babbage had proposed adapting Jacquard’s punched cards
for use in his “analytical engine.” But punched cards weren’t actually used for data processing until
the 1880s when Herman Hollerith, a U.S. Census Bureau employee, devised a method of using
punched cards to collate and tabulate census data. His punched cards were first used on a national
scale in the 1890 census, dramatically speeding the tabulation process: The 1880 census figures,
calculated entirely by hand, had taken seven years to tabulate. The 1890 census figures took six
weeks. Flush with the success of the 1890 census, Hollerith formed the Tabulating Machine
company in 1896 to market his punched cards and the machines used to punch, sort, and count them.
24 Unicode Demystified
Prehistory
This company eventually merged with two others, diversified into actual computers, and grew into
what we now know as the world’s largest computer company, International Business Machines.8
The modern IBM rectangular-hole punched card was first introduced in 1928 and has since become
the standard punched-card format. It had 80 columns, each representing a single character (it’s no
coincidence that most text-based CRT terminals had 80-column displays). Each column had twelve
punch positions. This would seem to indicate that Hollerith code, the means of mapping from punch
positions to characters, was a 12-bit code, which would give you 4,096 possible combinations.
It wasn’t. This is because you didn’t actually want to use all the possible combinations of punch
positions—doing so would put too many holes in the card and weaken its structural integrity (not a
small consideration when these things are flying through the sorting machine at a rate of a few dozen
a second). The system worked like this: You had 10 main rows of punch holes, numbered 0 through
9, and two “zone” rows, 11 and 12 (row 0 also did double duty as a zone row). The digits were
represented by single punches in rows 0 through 9, and the space was represented with no punches at
all.
The letters were represented with two punches: a punch in row 11, 12, or 0 plus a punch in one of the
rows from 1 to 9. This divides the alphabet into three nine-letter “zones.” A few special characters
were represented with the extra couple one-punch combinations and the one remaining two-punch
combination. The others were represented with three-punch combinations: row 8 would be pressed
into service as an extra zone row and the characters would be represented with a combination of a
punch in row 8, a punch in row 11, 12, or 0, and a punch in one of the rows from 1 to 7 (row 9 wasn’t
used). This gave you three banks of seven symbols each (four banks if included a bank that used row
8 as a zone row without an additional punch in rows 11, 12, or 0). All together, you got a grand total
of 67 unique punch combinations. Early punched card systems didn’t use all of these combinations,
but later systems filled in until all of the possible combinations were being used.
In memory, the computers used a six-bit encoding system tied to the punch patterns: The four least-
significant bits would specify the main punch row (meaning 10 of the possible 16 combinations
would be used), and the two most-significant bits would identify which of the zone rows was
punched. (Actual systems weren’t always quite so straightforward, but this was the basic idea.)
This was sufficient to reproduce all of the one- and two-punch combinations in a straightforward
manner, and was known as the Binary Coded Decimal Information Code, BCDIC, or just BCD for
short.
IBM added two more bits to BCD to form the Extended Binary Coded Decimal Information Code, or
EBCDIC (pronounced “EB-suh-dick”), which first appeared with the introduction of the System/360
in 1964. It was backward compatible with the BCD system used in the punched cards, but added the
lower-case letters and a bunch of control codes borrowed from ASCII-1963 (you didn’t really need
control codes in a punched-card system, since the position of the columns on the cards, or the
division of information between cards, gave you the structure of the data and you didn’t need codes
for controlling the communication session. This code was designed both to allow a simple mapping
from character codes to punch positions on a punched card and, like ASCII, to produce a reasonable
sorting order when the numeric codes were used to sort character data (note that this doesn’t mean
8
Much of the information in the preceding paragraph is drawn from the Searle article; the source for the rest of
this section is Douglas W. Jones, “Punched Cards: An Illustrated Technical History” and “Doug Jones’
Punched Card Codes,” both found at http://www.cs.uiowa.edu/~jones/cards.
you get the same order as ASCII—digits sort after letters instead of before them, and lower case sorts
before upper case instead of after).
One consequence of EBCDIC’s lineage is that the three groups of nine letters in the alphabet that you
had in the punched-card codes are numerically separated from each other in the EBCDIC encoding: I
is represented by 0xC9 and J by 0xD1, leaving an eight-space gap in the numerical sequence. The
original version of EBCDIC encoded only 50 characters in an 8-bit encoding space, leaving a large
numbers of gaping holes with no assigned characters. Later versions of EBCDIC filled in these holes
in various ways, but retained the backward compatibility with the old punched-card system.
Although ASCII is finally taking over in IBM’s product line, EBCDIC still survives in the current
models of IBM’s System/390, even though punched cards are long obsolete. Backward compatibility
is a powerful force.
A couple interesting things happened to ASCII on its way to turning into ISO 646. First, it
formalized a system for applying the various accent and other diacritical marks used is European
languages to the letters. It did this not by introducing accented variants of all the letters–there was no
room for that—but by pressing a bunch of punctuation marks into service as diacritical marks. The
apostrophe did double duty as the acute accent, the opening quote mark as the grave accent, the
double quotation mark as the diaeresis (or umlaut), the caret as the circumflex accent, the swung dash
as the tilde, and the comma as the cedilla. To produce an accented letter, you’d follow the letter in
the code sequence with a backspace code and then the appropriate ‘accent mark.” On a teletype
machine, this would cause the letter to be overstruck with the punctuation mark, producing an ugly
but serviceable version of the accented letter.
The other thing is that ISO 646 leaves the definitions of twelve characters open (in ASCII, these
eleven characters are #, $, @, [, \, ], ^, ‘, {, |, }, and ~). These are called the “national use” code
positions. There’s an International Reference Version of ISO 646 that gives the American meanings
to the corresponding code points, but other national bodies were free to assign other characters to
these twelve code values. (Some national bodies did put accented letters in these slots.) The various
national variants have generally fallen out of use in favor of more-modern standards like ISO 8859
(see below), but vestiges of the old system still remain; for example, in many Japanese systems’
treatment of the code for \ as the code for ¥.9
9
The preceding information comes partially from the Jennings paper, and partially from Roman Czyborra,
“Good Old ASCII,” found at http://www.czyborra.com/charsets/iso646.html.
26 Unicode Demystified
Single-byte encoding systems
ASCII gave us the eight-bit byte. All earlier encoding systems (except for FIELDATA, which can be
considered an embryonic version of ASCII) used a six-bit byte. ASCII extended that to seven, and
most communication protocols tacked on an eighth bit as a parity-check bit. As the parity bit became
less necessary (especially in computer memory) and 7-biit ASCII codes were stored in eight-bit
computer bytes, it was only natural that the 128 bit combinations not defined by ASCII (the ones with
the high-order bit set) would be pressed into service to represent more characters.
This was anticipated as early as 1971, when the ECMA-35 standard was first published.10 This
standard later became ISO 2022. ISO 2022 sets forth a standard method of organizing the code
space for various character encoding methods. An ISO-2022-compliant character encoding can have
up to two sets of control characters, designated C0 and C1, and up to four sets of printing (“graphic”)
characters, designated G0, G1, G2, and G3.
The encoding space can either be seven or eight bits wide. In a seven-bit encoding, the byte values
from 0x00 to 0x1F are reserved for the C0 controls and the byte values from 0x20 to 0xFF are used
for the G0, G1, G2, and G3 sets. The range defaults to the G0 set, and escape sequences can be used
to switch it to one of the other sets.
In an eight-bit encoding, the range of values from 0x00 to 0x1F (the “CL area”) is the C0 controls
and the range from 0x80 to 0x9F (the “CR area”) is the C1 controls. The range from 0x20 to 0x7F
(the “GL area”) is always the G0 characters, and the range from 0xA0 to 0xFF (the “GR area”) can
be switched between the G1, G2, and G3 characters.
Control functions (i.e., signals to the receiving process, as opposed to printing characters) can be
represented either as single control characters or as sequences of characters usually beginning with a
control character (usually that control character is ESC and the multi-character control function is an
“escape sequence”). ISO 2022 uses escape sequences to represent C1 controls in the 7-bit systems,
and to switch the GR and GL areas between the various sets of printing characters in both the 7- and
8-bit versions. It also specifies a method of using escape sequences to associate the various areas of
the encoding space (C0, C1, G0, G1, G2, and G3) with actual sets of characters.
ISO 2022 doesn’t actually apply semantics to most of the code positions–the big exceptions are ESC,
DEL and the space, which are given the positions they have in ASCII. Other than these, character
semantics are taken from other standards, and escape sequences can be used to switch an area’s
interpretation from one standard to another (there’s a registry of auxiliary standards and the escape
sequences used to switch between them).
An ISO 2022-based standard can impose whatever semantics it wants on the various encoding areas
and can choose to use or not use the escape sequences for switching things. As a practical matter,
most ISO2022-derived standards put the ISO 646 IRV (i.e., US ASCII) printing characters in the G0
area and the C0 control functions from ISO 6429 (an ISO standard that defines a whole mess of
control functions, along with standardized C0 and C1 sets—the C0 set is the same as the ASCII
control characters).
10
The information in this section is taken from Peter K. Edberg, “Survey of Character Encodings,” Proceedings
of the 13th International Unicode Conference, session TA4, September 9, 1998, and from a quick glance at
the ECMA 35 standard itself.
ISO 8859
By far the most important of the ISO 2022-derived encoding schemes is the ISO 8859 family. 11 The
ISO 8859 standard comprises fourteen separate ISO 2022-compliant encoding standards, each
covering a different set of characters for a different set of languages. Each of these counts as a
separate standard: ISO 8859-1, for example, is the near-ubiquitous Latin-1 character set you see on
the Web.
Work on the ISO 8859 family began in 1982 as a joint project of ANSI and ECMA. The first part
was originally published in 1985 as ECMA-94. This was adopted as ISO 8859-1, and a later
addition of ECMA-94 became the first four parts of ISO 8859. The other parts of ISO 8859 likewise
originated in various ECMA standards.
The ISO 8859 series is oriented, a one might expect, toward European languages and European
usage, as well as certain languages around the periphery of Europe that get used a lot there. It aimed
to do a few things: 1) Do away with the use of backspace sequences as the way to represent accented
characters (the backspace thing is workable on a teletype, but doesn’t work at all on a CRT terminal
without some pretty fancy rendering hardware), 2) do away with the varying meanings of the
“national use” characters from ISO 646, replacing them with a set of code values that would have the
same meaning everywhere and still include everyone’s characters, and 3) unify various other national
and vendor standards that were attempting to do the same thing.
All of the parts of ISO 8859 are based on the ISO 2022 structure, and all have a lot in common.
Each of them assigns the ISO 646 printing characters to the G0 range, and the ISO 6429 C0 and C1
control characters to the C0 and C1 ranges. This means that each of them, whatever else they
include, includes the basic Latin alphabet and is downward compatible with ASCII. (That is, pure 7-
bit ASCII text, when represented with 8-bit bytes, conforms to any of the 8859 standards.) Where
they differ is in their treatment of the G1 range (none of them defines anything in the G2 or G3 areas
or uses escape sequences to switch interpretations of any of the code points, although you can use the
registered ISO 2022 escape sequences to assign the G1 repertoire from any of these standards to the
various GR ranges in a generalized ISO 2022 implementation).
ISO 8859-1 Latin-1 Western European languages (French, German, Spanish, Italian, the
Scandinavian languages etc.)
ISO 8859-2 Latin-2 Eastern European languages (Czech, Hungarian, Polish, Romanian, etc.)
ISO 8859-3 Latin-3 Southern European languages (Maltese and Turkish, plus Esperanto)
ISO 8859-5 Cyrillic Russian, Bulgarian, Ukrainian, Belarusian, Serbian, and Macedonian
11
Much of the information in this section is drawn from Roman Czyborra, “ISO 8859 Alphabet Soup”, found at
http://www.czyborra.com/charsets/iso8859.html, supplemented with info from the ISO
Web site, the ECMA 94 standard, and the Edberg article.
28 Unicode Demystified
Other documents randomly have
different content
with active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookbell.com