0% found this document useful (0 votes)
5 views63 pages

Unicode Demystified A Practical Programmers Guide To The Encoding Standard 1st Edition Richard Gillam instant download

Unicode Demystified by Richard Gillam is a comprehensive guide to the Unicode encoding standard, detailing its architecture, history, and practical applications for programmers. The book covers various aspects of Unicode, including character properties, storage formats, and techniques for handling Unicode text. It serves as both an introduction for beginners and a reference for experienced developers working with text encoding.

Uploaded by

iuditronchy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views63 pages

Unicode Demystified A Practical Programmers Guide To The Encoding Standard 1st Edition Richard Gillam instant download

Unicode Demystified by Richard Gillam is a comprehensive guide to the Unicode encoding standard, detailing its architecture, history, and practical applications for programmers. The book covers various aspects of Unicode, including character properties, storage formats, and techniques for handling Unicode text. It serves as both an introduction for beginners and a reference for experienced developers working with text encoding.

Uploaded by

iuditronchy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Unicode Demystified A Practical Programmers

Guide To The Encoding Standard 1st Edition


Richard Gillam download

https://ebookbell.com/product/unicode-demystified-a-practical-
programmers-guide-to-the-encoding-standard-1st-edition-richard-
gillam-2451878

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Unicode Standard Version 50 The 5th Edition 5th Edition The Unicode
Consortium

https://ebookbell.com/product/unicode-standard-version-50-the-5th-
edition-5th-edition-the-unicode-consortium-2366312

Unicode Explained Includes Index 1st Ed Korpela Jukka K

https://ebookbell.com/product/unicode-explained-includes-index-1st-ed-
korpela-jukka-k-11831694

The Unicode Standard Version 40 The Unicode Consortium

https://ebookbell.com/product/the-unicode-standard-version-40-the-
unicode-consortium-2159298

The Unicode Standard Version 62 Core Specification Edited By Julie D


Allen

https://ebookbell.com/product/the-unicode-standard-version-62-core-
specification-edited-by-julie-d-allen-4071642
Fonts Encodings From Unicode To Advanced Typography And Everything In
Between 1st Edition Yannis Haralambous

https://ebookbell.com/product/fonts-encodings-from-unicode-to-
advanced-typography-and-everything-in-between-1st-edition-yannis-
haralambous-22034648

Proposition Dajouter Lcriture Tifinaghe Tifinagh Unicode Propositions


Evolution Of The Tifinagh Script In Unicode P Andries

https://ebookbell.com/product/proposition-dajouter-lcriture-tifinaghe-
tifinagh-unicode-propositions-evolution-of-the-tifinagh-script-in-
unicode-p-andries-11949218

Europar 2008 Workshops Parallel Processing Vhpc 2008 Unicore 2008 Hppc
2008 Sgs 2008 Proper 2008 Roia 2008 And Dpa 2008 Las Palmas De Gran
Canaria Spain August 2526 2008 Revised Selected Papers 1st Edition
Michael Alexander
https://ebookbell.com/product/europar-2008-workshops-parallel-
processing-
vhpc-2008-unicore-2008-hppc-2008-sgs-2008-proper-2008-roia-2008-and-
dpa-2008-las-palmas-de-gran-canaria-spain-august-2526-2008-revised-
selected-papers-1st-edition-michael-alexander-2039510

Twenty Years Of Health System Reform In Brazil An Assessment Of The


Sistema Unico De Saude Couttolenc

https://ebookbell.com/product/twenty-years-of-health-system-reform-in-
brazil-an-assessment-of-the-sistema-unico-de-saude-couttolenc-5207718

Europar 2007 Workshops Parallel Processing Hppc 2007 Unicore Summit


2007 And Vhpc 2007 Rennes France August 2831 2007 Revised Selected
Papers Computer Science And General Issues 1st Edition Luc Boug

https://ebookbell.com/product/europar-2007-workshops-parallel-
processing-hppc-2007-unicore-summit-2007-and-vhpc-2007-rennes-france-
august-2831-2007-revised-selected-papers-computer-science-and-general-
issues-1st-edition-luc-boug-1293316
Unicode
Demystified

A Practical Programmer’s
Guide to the Encoding Standard

by Richard Gillam
Copyright ©2000–2002 by Richard T. Gillam. All rights reserved.

Pre-publication draft number 0.3.1

Tuesday, January 15, 2002


To Mark, Kathleen, Laura, Helena, Doug,
John F, John R, Markus, Bertrand, Alan, Eric,
and the rest of the old JTCSV Unicode crew,
without whom this book would not have been possible

and

To Ted and Joyce Gillam,


without whom the author would not have been possible
Table of Contents

Table of Contents v
Preface xv
About this book xvi
How this book was produced xviii
The author’s journey xviii
Acknowledgements xix
A personal appeal xx

Unicode in Essence An
Architectural
Overview of the Unicode Standard 1
CHAPTER 1 Language, Computers, and Unicode 3
What Unicode Is 6
What Unicode Isn’t 8
The challenge of representing text in computers 10
What This Book Does 14
How this book is organized 15
Section I: Unicode in Essence 15
Section II: Unicode in Depth 16
Section III: Unicode in Action 17

CHAPTER 2 A Brief History of Character Encoding 19


Prehistory 19
The telegraph and Morse code 20

v
Table of Contents

The teletypewriter and Baudot code 21


Other teletype and telegraphy codes 22
FIELDATA and ASCII 23
Hollerith and EBCDIC 24
Single-byte encoding systems 26
Eight-bit encoding schemes and the ISO 2022 model 27
ISO 8859 28
Other 8-bit encoding schemes 29
Character encoding terminology 30
Multiple-byte encoding systems 32
East Asian coded character sets 32
Character encoding schemes for East Asian coded character sets 33
Other East Asian encoding systems 36
ISO 10646 and Unicode 36
How the Unicode standard is maintained 41

CHAPTER 3 Architecture: Not Just a Pile of Code Charts 43


The Unicode Character-Glyph Model 44
Character positioning 47
The Principle of Unification 50
Alternate-glyph selection 53
Multiple Representations 54
Flavors of Unicode 56
Character Semantics 58
Unicode Versions and Unicode Technical Reports 60
Unicode Standard Annexes 60
Unicode Technical Standards 61
Unicode Technical Reports 61
Draft and Proposed Draft Technical Reports 61
Superseded Technical Reports 62
Unicode Versions 62
Unicode stability policies 63
Arrangement of the encoding space 64
Organization of the planes 64
The Basic Multilingual Plane 66
The Supplementary Planes 69
Non-Character code point values 72
Conforming to the standard 73
General 74
Producing text as output 75
Interpreting text from the outside world 75
Passing text through 76
Drawing text on the screen or other output devices 76
Comparing character strings 77
Summary 77

CHAPTER 4 Combining character sequences and Unicode


normalization 79
How Unicode non-spacing marks work 81

vi Unicode Demystified
Dealing properly with combining character sequences 83
Canonical decompositions 84
Canonical accent ordering 85
Double diacritics 87
Compatibility decompositions 88
Singleton decompositions 90
Hangul 91
Unicode normalization forms 93
Grapheme clusters 94

CHAPTER 5 Character Properties and the Unicode Character


Database 99
Where to get the Unicode Character Database 99
The UNIDATA directory 100
UnicodeData.txt 103
PropList.txt 105
General character properties 107
Standard character names 107
Algorithmically-derived names 108
Control-character names 109
ISO 10646 comment 109
Block and Script 110
General Category 110
Letters 110
Marks 112
Numbers 112
Punctuation 113
Symbols 114
Separators 114
Miscellaneous 114
Other categories 115
Properties of letters 117
SpecialCasing.txt 117
CaseFolding.txt 119
Properties of digits, numerals, and mathematical symbols 119
Layout-related properties 120
Bidirectional layout 120
Mirroring 121
Atabic contextual shaping 122
East Asian width 122
Line breaking property 123
Normalization-related properties 124
Decomposition 124
Decomposition type 124
Combining class 126
Composition exclusion list 127
Normalization test file 127
Derived normalization properties 128
Grapheme-cluster-related properties 128
Unihan.txt 129

A Practical Programmer’s Guide to the Encoding Standard vii


Table of Contents

CHAPTER 6 Unicode Storage and Serialization Formats 131


A historical note 132
UTF-32 133
UTF-16 and the surrogate mechanism 134
Endian-ness and the Byte Order Mark 136
UTF-8 138
CESU-8 141
UTF-EBCDIC 141
UTF-7 143
Standard Compression Scheme for Unicode 143
BOCU 146
Detecting Unicode storage formats 147

Unicode in Depth A
Guided Tour of the
Character Repertoire 149
CHAPTER 7 Scripts of Europe 151
The Western alphabetic scripts 151
The Latin alphabet 153
The Latin-1 characters 155
The Latin Extended A block 155
The Latin Extended B block 157
The Latin Extended Additional block 158
The International Phonetic Alphabet 159
Diacritical marks 160
Isolated combining marks 164
Spacing modifier letters 165
The Greek alphabet 166
The Greek block 168
The Greek Extended block 169
The Coptic alphabet 169
The Cyrillic alphabet 170
The Cyrillic block 173
The Cyrillic Supplementary block 173
The Armenian alphabet 174
The Georgian alphabet 175

CHAPTER 8 Scripts of The Middle East 177


Bidirectional Text Layout 178
The Unicode Bidirectional Layout Algorithm 181
Inherent directionality 181
Neutrals 184
Numbers 185
The Left-to-Right and Right-to-Left Marks 186
The Explicit Embedding Characters 187

viii Unicode Demystified


Mirroring characters 188
Line and Paragraph Boundaries 188
Bidirectional Text in a Text-Editing Environment 189
The Hebrew Alphabet 192
The Hebrew block 194
The Arabic Alphabet 194
The Arabic block 199
Joiners and non-joiners 199
The Arabic Presentation Forms B block 201
The Arabic Presentation Forms A block 202
The Syriac Alphabet 202
The Syriac block 204
The Thaana Script 205
The Thaana block 207

CHAPTER 9 Scripts of India and Southeast Asia 209


Devanagari 212
The Devanagari block 217
Bengali 221
The Bengali block 223
Gurmukhi 223
The Gurmukhi block 225
Gujarati 225
The Gujarati block 226
Oriya 226
The Oriya block 227
Tamil 227
The Tamil block 230
Telugu 230
The Telugu block 232
Kannada 232
The Kannada block 233
Malayalam 234
The Malayalam block 235
Sinhala 235
The Sinhala block 236
Thai 237
The Thai block 238
Lao 239
The Lao block 240
Khmer 241
The Khmer block 243
Myanmar 243
The Myanmar block 244
Tibetan 245
The Tibetan block 247
The Philippine Scripts 247

CHAPTER 10 Scripts of East Asia 251


The Han characters 252

A Practical Programmer’s Guide to the Encoding Standard ix


Table of Contents

Variant forms of Han characters 261


Han characters in Unicode 263
The CJK Unified Ideographs area 267
The CJK Unified Ideographs Extension A area 267
The CJK Unified Ideographs Extension B area 267
The CJK Compatibility Ideographs block 268
The CJK Compatibility Ideographs Supplement block 268
The Kangxi Radicals block 268
The CJK Radicals Supplement block 269
Indeographic description sequences 269
Bopomofo 274
The Bopomofo block 275
The Bopomofo Extended block 275
Japanese 275
The Hiragana block 281
The Katakana block 281
The Katakana Phonetic Extensions block 281
The Kanbun block 281
Korean 282
The Hangul Jamo block 284
The Hangul Compatibility Jamo block 285
The Hangul Syllables area 285
Halfwidth and fullwidth characters 286
The Halfwidth and Fullwidth Forms block 288
Vertical text layout 288
Ruby 292
The Interlinear Annotation characters 293
Yi 294
The Yi Syllables block 295
The Yi Radicals block 295

CHAPTER 11 Scripts from Other Parts of the World 297


Mongolian 298
The Mongolian block 300
Ethiopic 301
The Ethiopic block 303
Cherokee 303
The Cherokee block 304
Canadian Aboriginal Syllables 304
The Unified Canadian Aboriginal Syllabics block 305
Historical scripts 305
Runic 306
Ogham 307
Old Italic 307
Gothic 308
Deseret 309

CHAPTER 12 Numbers, Punctuation, Symbols, and Specials 311


Numbers 311

x Unicode Demystified
Western positional notation 312
Alphabetic numerals 313
Roman numerals 313
Han characters as numerals 314
Other numeration systems 317
Numeric presentation forms 319
National and nominal digit shapes 319
Punctuation 320
Script-specific punctuation 320
The General Punctuation block 322
The CJK Symbols and Punctuation block 323
Spaces 323
Dashes and hyphens 325
Quotation marks, apostrophes, and similar-looking characters 326
Paired punctuation 331
Dot leaders 332
Bullets and dots 332
Special characters 333
Line and paragraph separators 333
Segment and page separators 335
Control characters 336
Characters that control word wrapping 336
Characters that control glyph selection 339
The grapheme joiner 345
Bidirectional formatting characters 346
Deprecated characters 346
Interlinear annotation 347
The object-replacement character 348
The general substitution character 348
Tagging characters 349
Non-characters 351
Symbols used with numbers 351
Numeric punctuation 351
Currency symbols 352
Unit markers 353
Math symbols 353
Mathematical alphanueric symbols 356
Other symbols and miscellaneous characters 357
Musical notation 357
Braille 359
Other symbols 359
Presentation forms 360
Miscellaneous characters 361

Unicode in Action Implementing and Using


the Unicode Standard 363
CHAPTER 13Techniques and Data Structures for Handling
Unicode Text 365
Useful data structures 366

A Practical Programmer’s Guide to the Encoding Standard xi


Table of Contents

Testing for membership in a class 366


The inversion list 369
Performing set operations on inversion lists 370
Mapping single characters to other values 374
Inversion maps 375
The compact array 376
Two-level compact arrays 379
Mapping single characters to multiple values 380
Exception tables 381
Mapping multiple characters to other values 382
Exception tables and key closure 382
Tries as exception tables 385
Tries as the main lookup table 388
Single versus multiple tables 390

CHAPTER 14 Conversions and Transformations 393


Converting between Unicode encoding forms 394
Converting between UTF-16 and UTF-32 395
Converting between UTF-8 and UTF-32 397
Converting between UTF-8 and UTF-16 401
Implementing Unicode compression 401
Unicode normalization 408
Canonical decomposition 409
Compatibility decomposition 413
Canonical composition 414
Optimizing Unicode normalization 420
Testing Unicode normalization 420
Converting between Unicode and other standards 421
Getting conversion information 421
Converting between Unicode and single-byte encodings 422
Converting between Unicode and multi-byte encodings 422
Other types of conversion 422
Handling exceptional conditions 423
Dealing with differences in encoding philosophy 424
Choosing a converter 425
Line-break conversion 425
Case mapping and case folding 426
Case mapping on a single character 426
Case mapping on a string 427
Case folding 427
Transliteration 428

CHAPTER 15 Searching and Sorting 433


The basics of language-sensitive string comparison 433
Multi-level comparisons 436
Ignorable characters 438
French accent sorting 439
Contracting character sequences 440
Expanding characters 440
Context-sensitive weighting 441

xii Unicode Demystified


Putting it all together 441
Other processes and equivalences 442
Language-sensitive comparison on Unicode text 443
Unicode normalization 443
Reordering 444
A general implementation strategy 445
The Unicode Collation Algorithm 447
The default UCA sort order 449
Alternate weighting 451
Optimizations and enhancements 453
Language-insensitive string comparison 455
Sorting 457
Collation strength and secondary keys 457
Exposing sort keys 459
Minimizing sort key length 460
Searching 461
The Boyer-Moore algorithm 462
Using Boyer-Moore with Unicode 465
“Whole words” searches 466
Using Unicode with regular expressions 466

CHAPTER 16 Rendering and Editing 469


Line breaking 470
Line breaking properties 471
Implementing boundary analysis with pair tables 473
Implementing boundary analysis with state machines 474
Performing boundary analysis using a dictionary 476
A couple more thoughts about boundary analysis 477
Performing line breaking 477
Line layout 479
Glyph selection and positioning 483
Font technologies 483
Poor man’s glyph selection 485
Glyph selection and placement in AAT 487
Glyph selection and placement in OpenType 489
Special-purpose rendering technology 491
Compound and virtual fonts 491
Special text-editing considerations 492
Optimizing for editing performance 492
Accepting text input 496
Handling arrow keys 497
Handling discontiguous selection 499
Handling multiple-click selection 500

CHAPTER 17 Unicode and Other Technologies 503


Unicode and the Internet 503
The W3C character model 504
XML 506
HTML and HTTP 507
URLs and domain names 508
Mail and Usenet 509
Unicode and programming languages 512

A Practical Programmer’s Guide to the Encoding Standard xiii


Table of Contents

The Unicode identifier guidelines 512


Java 512
C and C++ 513
Javascript and JScript 513
Visual Basic 513
Perl 514
Unicode and operating systems 514
Microsoft Windows 514
MacOS 515
Varieties of Unix 516
Unicode and databases 517
Conclusion 517

Glossary 519
Bibliography 591
The Unicode Standard 591
Other Standards Documents 593
Books and Magazine Articles 593
Unicode Conference papers 594
Other papers 594
Online resources 595

xiv Unicode Demystified


Preface

As the ecomonies of the world continue to become more connected together, and as the American
computer market becomes more and more saturated, computer-related businesses are looking more
and more to markets outside the United States to grow their businesses. At the same time, companies
in other industries are not only beginning to do the same thing (or, in fact, have been for a long time),
but are increasingly turning to computer technology, especially the Internet, to grow their businesses
and streamline their operations.

The convergence of these two trends means that it’s no longer just an English-only market for
computer software. More and more, computer software is being used not only by people outside the
United States or by people whose first language isn’t English, but by people who don’t speak English
at all. As a result, interest in software internationalization is growing in the software development
community.

A lot of things are involved in software internationalization: displaying text in the user’s native
language (and in different languages depending on the user), accepting input in the user’s native
language, altering window layouts to accommodate expansion or contraction of text or differences in
writing direction, displaying numeric values acording to local customs, indicating events in time
according to the local calendar systems, and so on.

This book isn’t about any of these things. It’s about something more basic, and which underlies most
of the issues listed above: representing written language in a computer. There are many different
ways to do this; in fact, there are several for just about every language that’s been represented in
computers. In fact, that’s the whole problem. Designing software that’s flexible enough to handle
data in multiple languages (at least multiple languages that use different writing systems) has
traditionally meant not just keeping track of the text, but also keeping track of which encoding
scheme is being used to represent it. And if you want to mix text in multiple writing systems, this
bookkeeping becomes more and more cumbersome.

xv
Table of Contents

The Unicode standard was designed specifically to solve this problem. It aims to be the universal
character encoding standard, providing unique, unambiguous representations for every character in
virtually every writing system and language in the world. The most recent version of Unicode
provides representations for over 90,000 characters.

Unicode has been around for twelve years now and is in its third major revision, adding support for
more languages with each revision. It has gained widespread support in the software community and
is now supported in a wide variety of operating systems, programming languages, and application
programs. Each of the semiannual International Unicode Conferences is better-attended than the
previous one, and the number of presenters and sessions at the Conferences grows correspondingly.

Representing text isn’t as straightforward as it appears at first glance: it’s not merely as simple as
picking out a bunch of characters and assigning numbers to them. First you have to decide what a
“character” is, which isn’t as obvious in many writing systems as it is in English. You have to
contend with things such as how to represent characters with diacrtical marks applied to them, how to
represent clusters of marks that represent syllables, when differently-shaped marks on the page are
different “characters” and when they’re just different ways of writing the same “character,” what
order to store the characters in when they don’t proceed in a straightforward manner from one side of
the page to the other (for example, some characters stack on top of each other, or you have two
parallel lines of characters, or the reading order of the text on the page zigzags around the line
because of differences in natural reading direction), and many similar issues.

The decisions you make on each of these issues for every character affect how various processes,
such as comparing strings or analyzing a string for word boundaries, are performed, making them
more complicated. In addition, the sheer number of different characters representable using the
Unicode standard make many processes on text more complicated.

For all of these reasons, the Unicode standard is a large, complicated affair. Unicode 3.0, the last
version published as a book, is 1,040 pages long. Even at this length, many of the explanations are
fairly concise and assume the reader already has some degree of familiarity with the problems to be
solved. It can be kind of intimidating.

The aim of this book is to provide an easier entrée into the world of Unicode. It arranges things in a
more pedagogical manner, takes more time to explain the various issues and how they’re solved, fills
in various pieces of background information, and adds implementation information and information
on what Unicode support is already out there. It is this author’s hope that this book will be a worthy
companion to the standard itself, and will provide the average programmer and the
internationalization specialist alike with all the information they need to effectively handle Unicode
in their software.

About this book


There are a few things you should keep in mind as you go through this book:

x This book assumes the reader either is a professional computer programmer or is familiar with
most computer-programming concepts and terms. Most general computer-science jargon isn’t
defined or explained here.

xvi Unicode Demystified


x It’s helpful, but not essential, if the reader has some basic understanding of the basic concepts of
software internationalization. Many of those concepts are explained here, but if they’re not
central to one of the book’s topics, they’re not given a lot of time.
x This book covers a lot of ground, and it isn’t intended as a comprehensive and definitive
reference for every single topic it discusses. In particular, I’m not repeating the entire text of the
Unicode standard here; the idea is to complement the standard, not replace it. In many cases, this
book will summarize a topic or attempt to explain it at a high level, leaving it to other documents
(typically the Unicode standard or one of its technical reports) to fill in all the details.
x The Unicode standard changes rapidly. New versions come out yearly, and small changes, new
technical reports, and other things happen more quickly. In Unicode’s history, terminology has
changed, and this will probably continue to happen from time to time. In addition, there are a lot
of other technologies that use or depend on Unicode, and they are also constantly changing, and
I’m certainly not an expert on every single topic I discuss here. (In my darker moments, I’m not
sure I’m an expert on any of them!) I have made every effort I could to see to it that this book is
complete, accurate, and up to date, but I can’t guarantee I’ve succeeded in every detail. In fact, I
can almost guarantee you that there is information in here that is either outdated or just plain
wrong. But I have made every effort to make the proportion of such information in this book as
small as possible, and I pledge to continue, with each future version, to try to bring it closer to
being fully accurate.
x At the time of this writing (January 2002), the newest version of Unicode, Unicode 3.2, was in
beta, and thus still in flux. The Unicode 3.2 spec is schedule to be finalized in March 2002, well
before this book actually hits the streets. With a few exceptions, I don’t expect major changes
between now and March, but they’re always possible, and therefore, the Unicode 3.2 information
in this book may wind up wrong in some details. I’ve tried to flag all the Unicode 3.2-specific
information here as being from Unicode 3.2, and I’ve tried to indicate the areas that I think are
still in the greatest amount of flux.
x Sample code in this book is almost always in Java. This is partially because Java is the language I
personally use in my regular job, and thus the programming language I think in these days. But I
also chose Java because of its increasing importance and popularity in the programming world in
general and because Java code tends to be somewhat easier to understand than, say, C (or at least
no more difficult). Because of Java’s syntactic similarity to C and C++, I also hope the examples
will be reasonable accessible to C and C++ programmers who don’t also program in Java.
x The sample code is provided for illustrative purposes only. I’ve gone to the trouble, at least with
the examples that can stand alone, to make sure the examples all compile, and I’ve tested them to
make sure I didn’t make any obvious stupid mistakes, but they haven’t been tested
comprehensively. They were also written with far more of an eye toward explaining a concept
than being directly usable in any particular context. Incorporate them into your code at your own
risk!
x I’ve tried to define all the jargon the first time I use it or to indicate a full explanation is coming
later, but there’s also a glossary at the back you can refer to if you come across an unfamiliar term
that isn’t defined.
x Numeric constants, especially numbers representing characters, are pretty much always shown in
hexadecimal notation. Hexadecimal numbers in the text are always written using the 0x notation
familiar to C and Java programmers.
x Unicode code point values are shown using the standard Unicode notation, U+1234, where
“1234” is a hexadecimal number of from four to six digits. In many cases, a character is referred
to by both its Unicode code point value and its Unicode name: for example, “U+0041 LATIN
CAPITAL LETTER A.” Code unit values in one of the Unicode transformation formats are
shown using the 0x notation.

A Practical Programmer’s Guide to the Encoding Standard xvii


Table of Contents

How this book was produced


All of the examples of text in non-Latin writing systems posed quite a challenge for the production
process. The bulk of this manuscript was written on a Power Macintosh G4/450 using Adobe
FrameMaker 6 running on MacOS 9. I did the original versions of the various diagrams in Microsoft
PowerPoint 98 on the Macintosh. But I discovered early on that FrameMaker on the Macintosh
couldn’t handle a lot of the characters I needed to be able to show in the book. I wound up writing
the whole thing with little placeholder notes to myself throughout describing what the examples were
supposed to look like.
FrameMaker was somewhat compatible with Apple’s WorldScript technology, enabling me to do
some of the example, but I quickly discovered Acrobat 3, which I was using at the time, wasn’t. It
crashed when I tried to created PDFs of chapters that included the non-Latin characters. Switching to
Windows didn’t prove much more fruitful: On both platforms, FrameMaker 6, Adobe Illustrator 9,
and Acrobat 3 and 4 were not Unicode compatible. The non-Latin characters would either turn into
garbage characters, not show up at all, or show up with very compromised quality.
Late in the process, I decided to switch to the one combination of software and platform I knew
would work: Microsoft Office 2000 on Windows 2000, which handles (with varying degrees of
difficulty) everything I needed to do. I converted the whole project from FrameMaker to Word and
spent a couple of months restoring all the missing examples to the text. (In a few cases where I
didn’t have suitable fonts at my disposal, or where Word didn’t product quite the results I wanted, I
either scanned pictures out of books or just left the placeholders in.) The last rounds of revisions
were done in Word on either the Mac or on a Windows machine, depending on where I was
physically at the time, and all the example text was done on the Windows machine.
This produced a manuscript of high-enough quality to get reviews from people, but didn’t really
produce anything we could typeset from. The work of converting the whole mess back to
FrameMaker, redrawing my diagrams in Illustrator, and coming up with another way to do the non-
Latin text examples fell to [name], who [finish the story after we’re far enough into production
to know how these problems were solved].
[Note to Ross and the AW production people: Is it customary to have a colophon on these
things? Because of the difficulty and importance of typesetting this particular material, I’m
thinking the normal information in a colophon should be included. Including it in the preface
might be difficult, however, as I can’t finish this section until at least some of the chapters have
been typeset. Should I leave this info here, move it to a real colophon at the back of the book,
or toss it?]

The author’s journey


Like many in the field, I fell into software internationalization by happenstance. I’ve always been
interested in language—written language in particular—and (of course) in computers. But my job
had never really dealt with this directly.

In the spring of 1995, that changed when I went to work for Taligent. Taligent, you may remember,
was the ill-fated joint venture between Apple Computer and IBM (later joined by Hewlett-Packard)
that was originally formed to create a new operating system for personal computers using state-of-
the-art object-oriented technology. The fruit of our labors, CommonPoint, turned out to be too little
too late, but it spawned a lot of technologies that found their places in other products.

xviii Unicode Demystified


Preface

For a while there, Taligent enjoyed a cachet in the industry as the place where Apple and IBM had
sent many of their best and brightest. If you managed to get a job at Taligent, you had “made it.”

I almost didn’t “make it.” I had wanted to work at Taligent for some time and eventually got the
chance, but turned in a rather unimpressive interview performance (a couple coworkers kidded me
about that for years afterward) and wasn’t offered the job. About that same time, a friend of mine did
get a job there, and after the person who did get the job I interviewed for turned it down for family
reasons, my friend put in a good word for me and I got a second chance.

I probably would have taken almost any job there, but the specific opening was in the text and
internationalization group, and thus began my long association with Unicode.

One thing pretty much everybody who ever worked at Taligent will agree on is that working there
was a wonderful learning experience: an opportunity, as it were, to “sit at the feet of the masters.”
Personally, the Taligent experience made me into the programmer I am today. My C++ and OOD
skills improved dramatically, I became proficient in Java, and I went from knowing virtually nothing
about written language and software internationalization to… well, I’ll let you be the judge.

My team was eventually absorbed into IBM, and I enjoyed a little over two years as an IBMer before
deciding to move on in early 2000. During my time at Taligent/IBM, I worked on four different sets
of Unicode-related text handling routines: the text-editing frameworks in CommonPoint, various text-
storage and internationalization frameworks in IBM’s Open Class Library, various
internationalization facilities in Sun’s Java Class Library (which IBM wrote under contract to Sun),
and the libraries that eventually came to be known as the International Components for Unicode.

International Components for Unicode, or ICU, began life as an IBM developer-tools package based
on the Java internationalization libraries, but has since morphed into an open-soruce project and
taken on a life of its own. It’s gaining increasing popularity and showing up in more operating
systems and software packages, and it’s acquiring a reputation as a great demonstration of how to
implement the various features of the Unicode standard. I had the twin privileges of contributing
frameworks to Java and ICU and of working alongside those who developed the other frameworks
and learning from them. I got to watch the Unicode standard develop, work with some of those who
were developing it, occasionally rub shoulders with the others, and occasionally contrbute a tidbit or
two to the effort myself. It was a fantastic experience, and I hope that at least some of their expertise
rubbed off on me.

Acknowledgements
It’s been said that it takes a village to raise a child. Well, I don’t really know about that, but it
definitely takes a village to write a book like this. The person whose name is on the cover gets to
take the credit, but there’s an army of people who contribute to the content.

Acknowledgements sections have a bad tendency to sound like acceptance speeches at the Oscars
as the authors go on forever thanking everyone in sight. This set of acknowledgments will be no
different. If you’re bored or annoyed by that sort of thing, I recomend you skip to the next section
now. You have been warned.

A Practical Programmer’s Guide to the Encoding Standard xix


Acknowledgements

Here goes: First and foremost, I’m indebted to the various wonderful and brilliant people I worked
with on the internationalization teams at Taligent and IBM: Mark Davis, Kathleen Wilson, Laura
Werner, Doug Felt, Helena Shih, John Fitzpatrick, Alan Liu, John Raley, Markus Scherer, Eric
Mader, Bertrand Damiba, Stephen Booth, Steven Loomis, Vladimir Weinstein, Judy Lin, Thomas
Scott, Brian Beck, John Jenkins, Deborah Goldsmith, Clayton Lewis, Chen-Lieh Huang, and
Norbert Schatz. Whatever I know about either Unicode or software internationalization I learned
from these people. I’d also like to thank the crew at Sun that we worked with: Brian Beck, Norbert
Lindenberg, Stuart Gill, and John O’Conner.

I’d also like to thank my managment and coworkers at Trilogy, particularly Robin Williamson,
Chris Hyams, Doug Spencer, Marc Surplus, Dave Griffith, John DeRegnaucourt, Bryon Jacob, and
Zach Roadhouse, for their understanding and support as I worked on this book, and especially for
letting me continue to participate in various Unicode-related activities, especially the conferences,
on company time and with company money.

Numerous people have helped me directly in my efforts to put this book together, by reviewing
parts of it and offering advice and corrections, by answering questions and giving advice, by
helping me put together example or letting me use examples they’d already put together, or simply
by offering an encouraging word or two. I’m tremendously indebted to all who helped out in these
ways: Jim Agenbroad, Matitiahu Allouche, Christian Cooke, John Cowan, Simon Cozens, Mark
Davis, Roy Daya, Andy Deitsch, Martin Dürst, Tom Emerson, Franklin Friedmann, David
Gallardo, Tim Greenwood, Jarkko Hietaniemi, Richard Ishida, John Jenkins, Kent Karlsson, Koji
Kodama, Alain LaBonté, Ken Lunde, Rick McGowan, John O’Conner, Chris Pratley, John Raley,
Jonathan Rosenne, Yair Sarig, Dave Thomas, and Garret Wilson. [Be sure to add the names of
anyone who sends me feedback between 12/31/01 and RTP.]

And of course I’d like to acknowledge all the people at Addison-Wesley who’ve had a hand in
putting this thing together: Ross Venables, my current editor; Julie DiNicola, my former editor;
John Fuller, the production manager; [name], who copy-edited the manuscript, [name], who had
the unenviable task of designing and typesetting the manuscript and cleaning up all my diagrams
and examples; [name], who put together the index; [name], who designed the cover; and Mike
Hendrickson, who oversaw the whole thing. I greatly appreciate their professionalism and their
patience in dealing with this first-time author.

And last but not least, I’d like to thank the family members and friends who’ve had to sit and listen
to me talk about this project for the last couple years: especially my parents, Ted and Joyce
Gillam; Leigh Anne and Ken Boynton, Ken Cancelosi, Kelly Bowers, Bruce Rittenhouse, and
many of the people listed above.

As always with these things, I couldn’t have done it without all these people. If this book is good,
they deserve the lion’s share of the credit. If it isn’t, I deserve the blame and owe them my
apologies.

A personal appeal
For a year and a half, I wrote a column on Java for C++ Report magazine, and for much of that
time, I wondered if anyone was actually reading the thing and what they thought of it. I would
occasionally get an email from a reader commenting on something I’d written, and I was always

xx Unicode Demystified
Preface

grateful, whether the feedback was good or bad, because it meant someone was reading the thing
and took it seriously enough to let me know what they thought.

I’m hoping there will be more than one edition of this book, and I really want it to be as good as
possible. If you read it and find it less than helpful, I hope you won’t just throw it on a shelf
somewhere and grumble about the money you threw away on it. Please, if this book fails to
adequately answer your questions about Unicode, or if it wastes too much time answering
questions you don’t care about, I want to know. The more specific you can be about just what isn’t
doing it for you, the better. Please write me at [email protected] with your comments
and criticisms.

For that matter, if you like what you see here, I wouldn’t mind hearing from you either. God
knows, I can use the encouragement.

—R. T. G.
Austin, Teaxs
January 2002

A Practical Programmer’s Guide to the Encoding Standard xxi


SECTION I

Unicode in Essence
An Architectural Overview of the
Unicode Standard
CHAPTER 1 Language, Computers, and
Unicode

Words are amazing, marvelous things. They have the power both to build up and to tear down great
civilizations. Words make us who we are. Indeed, many have observed that our use of language is
what separates humankind from the rest of the animal kingdom. Humans have the capacity for
symbolic thought, the ability to think about and discuss things we cannot immediately see or touch.
Language is our chief symbolic system for doing this. Consider even a fairly simple concept such as
“There is water over the next hill.” Without language, this would be an extraordinarily difficult thing
to convey.

There’s been a lot of talk in recent years about “information.” We face an “information explosion”
and live in an “information age.” Maybe this is true, but when it comes right down to it, we are
creatures of information. And language is one of our main ways both of sharing information with one
another and, often, one of our main ways of processing information.

We often hear that we are in the midst of an “information revolution,” surrounded by new forms of
“information technology,” a phrase that didn’t even exist a generation or two ago. These days, the
term “information technology” is generally used to refer to technology that helps us perform one or
more of three basic processes: the storage and retrieval of information, the extraction of higher levels
of meaning from a collection of information, and the transmission of information over large
distances. The telegraph, and later the telephone, was a quantum leap in the last of these three
processes, and the digital computer in the first two, and these two technologies form the cornerstone
of the modern “information age.”

Yet by far the most important advance in information technology occurred many thousands of years
ago and can’t be credited with an inventor. That advance (like so many technological revolutions,
really a series of smaller advances) was written language. Think about it: before written language,
storing and retrieving information over a long period of time relied mostly on human memory, or on

3
Language, Computers, and Unicode

precursors of writing, such as making notches in sticks. Human memory is unreliable, and storage
and retrieval (or storage over a time longer than a single person’s lifetime) involved direct oral
contact between people. Notches in sticks and the like avoids this problem, but doesn’t allow for a
lot of nuance or depth in the information being stored. Likewise, transmission of information over a
long distance either also required memorization, or relied on things like drums or smoke signals that
also had limited range and bandwidth.

Writing made both of these processes vastly more powerful and reliable. It enabled storage of
information in dramatically greater concentrations and over dramatically longer time spans than was
ever thought possible, and made possible transmission of information over dramatically greater
distances, with greater nuance and fidelity, than had ever been thought possible.

In fact, most of today’s data processing and telecommunications technologies have written language
as their basis. Much of what we do with computers today is use them to store, retrieve, transmit,
produce, analyze, and print written language.

Information technology didn’t begin with the computer, and it didn’t begin with the telephone or
telegraph. It began with written language.

This is a book about how computers are used to store, retrieve, transmit, manipulate, and analyze
written language.

***

Language makes us who we are. The words we choose speak volumes about who we are and about
how we see the world and the people around us. They tell others who “our people” are: where we’re
from, what social class we belong to, possibly even who we’re related to.

The world is made up of hundreds of ethnic groups, who constantly do business with, create alliances
with, commingle with, and go to war with each other. And the whole concept of “ethnic group” is
rooted in language. Who “your people” are is rooted not only in which language you speak, but in
which language or languages your ancestors spoke. As a group’s subgroups become separated, the
languages the two subgroups speak will begin to diverge, eventually giving rise to multiple languages
that share a common heritage (for example, classical Latin diverging into modern Spanish and
French), and as different groups come into contact with each other, their respective languages will
change under each other’s influence (for example, much of modern english vocabulary was borrowed
in from French). Much about a group’s history is encoded in its language.

We live in a world of languages. There are some 6,000 to 7,000 different languages spoken in the
world today, each with countless dialects and regional variations1. We may be united by language,
but we’re divided by our languages.

And yet the world of computing is strangely homogeneous. For decades now, the language of
computing has been English. Specifically, American English. Thankfully, this is changing. One of the
things that information technology is increasingly making possible is contact between people in
different parts of the world. In particular, information technology is making it more and more

1
SIL International’s Ethnologue Web site (www.ethnologue.com) lists 6,800 “main” languages
and 41,000 variants and dialects.

4 Unicode Demystified
Language, Computers, and Unicode

possible to do business in different parts of the world. And as information technology invades more
and more of the world, people are increasingly becoming unwilling to speake the language of
information technology—they want information technology to speak their language.

This is a book about how all, or at least most, written languages—not just English or Western
European languages—can be used with information technology.

***

Language makes us who we are. Almost every human activity has its own language, a specialized
version of a normal human language adapted to the demands of discussing a particular activity. So
the language you speak also says much about what you do.

Every profession has its jargon, which is used both to provide a more precise method of discussing
the various aspects of the profession and to help tell “insiders” from “outsiders.” The information
technology industry has always had a reputation as one of the most jargon-laden professional groups
there is. This probably isn’t true, but it looks that way because of the way computers have come to
permeate our lives: now that non-computer people have to deal with computers, non-computer people
are having to deal with the language that computer people use.

In fact, it’s interesting to watch as the language of information technology starts to infect the
vernacular: “I don’t have the bandwidth to deal with that right now.” “Joe, can you spare some cycles
to talk to me?” And my personal favorite, from a TV commercial from a few years back: “No
shampoo helps download dandruff better.”

What’s interesting is that subspecialties within information technology each have their own jargon as
well that isn’t shared by computer people outside their subspecialty. In the same way that there’s
been a bit of culture clash as the language of computers enters the language of everyday life, there’s
been a bit of culture clash as the language of software internationalization enters the language of
general computing.

This is a good development, because it shows the increasing interest of the computing community in
developing computers, software, and other products that can deal with people in their native
languages. We’re slowly moving from (apologies to Henry Ford) “your native language, as long as
it’s English” to “your native language.” The challenge of writing one piece of software that can deal
with users in multiple human languages involves many different problems that need to be solved, and
each of those problems has its own terminology.

This is a book about some of that terminology.

***

One of the biggest problems to be dealt with in software internationalization is that the ways human
language have been traditionally represented inside computers don’t often lend themselves to many
human languages, and they lend themselves especially badly to multiple human languages at the same
time.

Over time, systems have been developed for representing quite a few different written languages in
computers, but each scheme is generally designed for only a single language, or at best a small

A Practical Programmer’s Guide to the Encoding Standard 5


Language, Computers, and Unicode

collection of related languages, and these systems are mutually incompatible. Interpreting a series of
bits encoded with one standard using the rules of another yields gibberish, so software that handles
multiple languages has traditionally had to do a lot of extra bookkeeping to keep track of the various
different systems used to encode the characters of those languages. This is difficult to do well, and
few pieces of software attempt it, leading to a Balkanization of computer software and the data it
manipulates.

Unicode solves this problem by providing a unified representation for the characters in the various
written languages. By providing a unique bit pattern for every single character, you eliminate the
problem of having to keep track of which of many different characters this particular instance of a
particular bit pattern is supposed to represent.

Of course, each language has its own peculiarities, and presents its own challenges for computerized
representation. Dealing with Unicode doesn’t necessarily mean dealing with all the peculiarities of
the various languages, but it can—it depends on how many languages you actually want to support in
your software, and how much you can rely on other software (such as the operating system) to do that
work for you.

In addition, because of the sheer number of characters it encodes, there are challenges to dealing with
Unicode-encoded text in software that go beyond those of dealing with the various languages it
allows you to represent. The aim of this book is to help the average programmer find his way
through the jargon and understand what goes into dealing with Unicode.

This is a book about Unicode.

What Unicode Is

Unicode is a standard method for representing written language in computers. So why do we need
this? After all, there are probably dozens, if not hundreds, of ways of doing this already. Well, this is
exactly the point. Unicode isn’t just another in the endless parade of text-encoding standards; it’s an
attempt to do away with all the others, or at least simplify their use, by creating a universal text
encoding standard.

Let’s back up for a second. The best known and most widely used character encoding standard is the
American Standard Code for Information Interchange, or ASCII for short. The first version of ASCII
was published in 1964 as a standard way of representing textual data in computer memory and
sending it over communication links between computers. ASCII is based on a eeven-bit byte. Each
byte represented a character, and characters were represented by assigning them to individual bit
patterns (or, if you prefer, individual numbers). A seven-bit byte can have 128 different bit patterns.
33 of these were set aside for use as control signals of various types (start- and end-of-transmission
codes, block and record separators, etc.), leaving 95 free for representing characters.

Perhaps the main deficiency in ASCII comes from the A in its name: American. ASCII is an
American standard, and was designed for the storage and transmission of English text. 95 characters
are sufficient for representing English text, barely, but that’s it. On early teletype machines, ASCII
could also be used to represent the accented letters found in many European languages, but this
capability disappeared in the transition from teletypes to CRT terminals.

6 Unicode Demystified
Language, Computers, and Unicode

So, as computer use became more and more widespread in different parts of the world, alternative
methods of representing characters in computers arose for representing other languages, leading to
the situation we have today, where there are generally three or four different encoding schemes for
every language and writing system in use today.

Unicode is the latest of several attempts to solve this Tower of Babel problem by creating a universal
character encoding. Its main way of doing this is to increase the size of the possible encoding space
by increasing the number of bits used to encode each character. Most other character encodings are
based upon an eight-bit byte, which provides enough space to encode a maximum of 256 characters
(in practice, most encodings reserve some of these values for control signals and encode fewer than
256 characters). For languages, such as Japanese, that have more than 256 characters, most
encodings are still based on the eight-bit byte, but use sequences of several bytes to represent most of
the characters, using relatively complicated schemes to manage the variable numbers of bytes used to
encode the characters.

Unicode uses a 16-bit word to encode characters, allowing up to 65,536 characters to be encoded
without resorting to more complicated schemes involving multiple machine words per character.
65,000 characters, with careful management, is enough to allow encoding of the vast majority of
characters in the vast majority of written languages in use today. The current version of Unicode,
version 3.2, actually encodes 95,156 different characters—it actually does use a scheme to represent
the less-common characters using two 16-bit units, but with 50,212 characters actually encoded using
only a single unit, you rarely encounter the two-unit characters. In fact, these 50,212 characters
include all of the characters representable with all of the other character encoding methods that are in
reasonably widespread use.

This provides two main benefits: First, a system that wants to allow textual data (either user data or
things like messages and labels that may need to be localized for different user communities) to be in
any language would, without Unicode, have to keep track of not just the text itself, but also of which
character encoding method was being used. In fact, mixtures of languages might require mixtures of
character encodings. This extra bookkeeping means you couldn’t look directly at the text and know,
for example, that the value 65 was a capital letter A. Depending on the encoding scheme used for that
particular piece of text, it might represent some other character, or even be simply part of a character
(i.e., it might have to be considered along with an adjacent byte in order to be interpreted as a
character). This might also mean you’d need different logic to perform certain processes on text
depending on which encoding scheme they happened to use, or convert pieces of text between
different encodings.

Unicode does away with this. It allows all of the same languages and characters to be represented
using only one encoding scheme. Every character has its own unique, unambiguous value. The value
65, for example, always represents the capital letter A. You don’t need to rely on extra information
about the text in order to interpret it, you don’t need different algorithms to perform certain processes
on the text depending on the encoding or language, and you don’t (with some relatively rare
exceptions) need to consider context to correctly interpret any given 16-bit unit of text.

The other thing Unicode gives you is a pivot point for converting between other character encoding
schemes. Because it’s a superset of all of the other common character encoding systems, you can
convert between any other two encodings by converting from one of them to Unicode, and then from
Unicode to the other. Thus, if you have to provide a system that can convert text between any
arbitrary pair of encodings, the number of converters you have to provide can be dramatically
smaller. If you support n different encoding schemes, you only need 2n different converters, not n2
different converters. It also means that when you have to write a system that interacts with the
outside world using several different non-Unicode character representations, it can do its internal

A Practical Programmer’s Guide to the Encoding Standard 7


What Unicode Is

processing in Unicode and convert at the boundaries, rather than potentially having to have alternate
code to do the same things for text in the different outside encodings.

What Unicode Isn’t


It’s also important to keep in mind what Unicode isn’t. First, Unicode is a standard scheme for
representing plain text in computers and data communication. It is not a scheme for representing rich
text (sometimes called “fancy text” or “styled text”). This is an important distinction. Plain text is the
words, sentences, numbers, and so forth themselves. Rich text is plain text plus information about the
text, especially information on the text’s visual presentation (e.g., the fact that a given word is in
italics), the structure of a document (e.g., the fact that a piece of text is a section header or footnote),
or the language (e.g., the fact that a particular sentence is in Spanish). Rich text may also include
non-text items that travel with the text, such as pictures.

It can be somewhat tough to draw a line between what qualifies as plain text, and therefore should be
encoded in Unicode, and what’s really rich text. In fact, debates on this very subject flare up from
time to time in the various Unicode discussion forums. The basic rule is that plain text contains all of
the information necessary to carry the semantic meaning of the text—the letters, spaces, digits,
punctuation, and so forth. If removing it would make the text unintelligible, then it’s plain text.

This is still a slippery definition. After all, italics and boldface carry semantic information, and losing
them may lose some of the meaning of a sentence that uses them. On the other hand, it’s perfectly
possible to write intelligible, grammatical English without using italics and boldface, where it’d be
impossible, or at least extremely difficult, to write intelligible, grammatical English without the letter
“m”, or the comma, or the digit “3.” Some of it may also come down to user expectation—you can
write intelligible English with only capital letters, but it’s generally not considered grammatical or
acceptable nowadays.

There’s also a certain amount of document structure you need to be able to convey in plain text, even
though document structure is generally considered the province of rich text. The classic example is
the paragraph separator. You can’t really get by without a way to represent a paragraph break in plain
text without compromising legibility, even though it’s technically something that indicates document
structure. But many higher-level protocols that deal with document structure have their own ways of
marking the beginnings and endings of paragraphs. The paragraph separator, thus, is one of a
number of characters in Unicode that are explicitly disallowed (or ignored) in rich-text
representations that are based on Unicode. HTML, for example, allows paragraph marks, but they’re
not recognized by HTML parsers as paragraph marks. Instead, HTML uses the <P> and </P> tags to
mark paragraph boundaries.

When it comes down to it, the distinction between plain text and rich text is a judgment call. It’s kind
of like Potter Stewart’s famous remark on obscenity—“I may not be able to give you a definition, but
I know it when I see it.” Still, the principle is that Unicode encodes only plain text. Unicode may be
used as the basis of a scheme for representing rich text, but isn’t intended as a complete solution to
this problem on its own.

Rich text is an example of a “higher-level protocol,” a phrase you’ll run across a number of times in
the Unicode standard. A higher-level protocol is anything that starts with the Unicode standard as its
basis and then adds additional rules or processes to it. XML, for example, is a higher-level protocol
that uses Unicode as its base and adds rules that define how plain Unicode text can be used to

8 Unicode Demystified
Language, Computers, and Unicode

represent structured information through the use of various kinds of markup tags. To Unicode, the
markup tags are just Unicode text like everything else, but to XML, they delineate the structure of the
document. You can, in fact, have multiple layers or protocols: XHTML is a higher-level protocol for
representing rich text that uses XML as its base.

Markup languages such as HTML and XML are one example of how a higher-level protocol may be
used to represent rich text. The other main class of higher-level protocols involves the use of multiple
data structures, one or more of which contain plain Unicode text, and which are supplemented by
other data structures that contain the information on the document’s structure, the text’s visual
presentation, and any other non-text items that are included with the text. Most word processing
programs use schemes like this.

Another thing Unicode isn’t is a complete solution for software internationalization. Software
internationalization is a set of design practices that lead to software that can be adapted for various
international markets (“localized”) without having to modify the executable code. The Unicode
standard in all of its details includes a lot of stuff, but doesn’t include everything necessary to
produce internationalized software. In fact, it’s perfectly possible to write internationalized software
without using Unicode at all, and also perfectly possible to write completely non-internationalized
software that uses Unicode.

Unicode is a solution to one particular problem in writing internationalized software: representing


text in different languages without getting tripped up dealing with the multiplicity of encoding
standards out there. This is an important problem, but it’s not the only problem that needs to be
solved when developing internationalized software. Among the other things an internationalized
piece of software might have to worry about are:

x Presenting a different user interface to the user depending on what language he speaks. This may
involve not only translating any text in the user interface into the user’s language, but also altering
screen layouts to accommodate the size or writing direction of the translated text, changing icons
and other pictorial elements to be meaningful (or not to be offensive) to the target audience,
changing color schemes for the same reasons, and so forth.
x Altering the ways in binary values such as numbers, dates, and times are presented to the user, or
the ways in which the user enters these values into the system. This involves not only relatively
small things, like being able changing the character that’s used for a decimal point (it’s a comma,
not a period, in most of Europe) or the order of the various pieces of a date (day-month-year is
common in Europe), but possibly larger-scale changes (Chinese uses a completely different
system for writing numbers, for example, and Israel uses a completely different calendar system).
x Altering various aspects of your program’s behavior. For example sorting a list into alphabetical
order may produce different orders for the same list depending on language because “alphabetical
order” is a language-specific concept. Accounting software might need to work differently in
different places because of differences in accounting rules.
x And the list goes on…

To those experienced in software internationalization, this is all obvious, of course, but those who
aren’t often seem to use the words “Unicode” and “internationalization” interchangeably. If you’re in
this camp, be careful: if you’re writing in C++, storing all your character strings as arrays of
wchar_t doesn’t make your software internationalized. Likewise, if you’re writing in Java, the fact
that it’s in Java and the Java String class uses Unicode doesn’t automatically make your software
internationalized. If you’re unclear on the internationalization issues you might run into that Unicode
doesn’t solve, you can find an excellent introduction to the subject at http://www.xerox-
emea.com/globaldesign/paper/paper1.htm, along with a wealth of other useful papers and
other goodies.

A Practical Programmer’s Guide to the Encoding Standard 9


What Unicode Isn’t

Finally, another thing Unicode isn’t is a glyph registry. We’ll get into the Unicode character-glyph
model in Chapter 3, but it’s worth a quick synopsis here. Unicode draws a strong, important
distinction between a character, which is an abstract linguistic concept such as “the Latin letter A” or
“the Chinese character for ‘sun,’” and a glyph, which is a concrete visual presentation of a character,
such as A or . There isn’t a one-to-one correspondence between these two concepts: a single glyph
may represent more than one character (such a glyph is often called a ligature), such as the ¿OLJDWXUH
a single mark that represents the letters f and i together. Or a single character might be represented
by two or more glyphs: The vowel sound au in the Tamil language (• ) is represented by two
marks: one that goes to the left of a consonant character, and another on the right, but it’s still
thought of as a single character. A character may also be represented using different glyphs in
different contexts: The Arabic letter heh has one shape when it stands alone ( ) and another when it
occurs in the middle of a word ( ).

You’ll also see what we might consider typeface distinctions between different languages using the
same writing system. For instance, both Arabic and Urdu use the Arabic alphabet, but Urdu is
generally written in the more ornate Nastaliq style, while Arabic frequently isn’t. Japanese and
Chinese are both written using Chinese characters, but some characters have a different shape in

Chinese (for example, in Japanese is in Chinese).

Unicode, as a rule, doesn’t care about any of these distinctions. It encodes underlying semantic
concepts, not visual presentations (characters, not glyphs) and relies on intelligent rendering software
(or the user’s choice of fonts) to draw the correct glyphs in the correct places. Unicode does
sometime encode glyphic distinctions, but only when necessary to preserve interoperability with
some preexisting standard or to preserve legibility (i.e., if smart rendering software can’t pick the
right glyph for a particular character in a particular spot without clues in the encoded text itself).
Despite these exceptions, Unicode by design does not attempt to catalogue every possible variant
shape for a particular character. It encodes the character and leaves the shape to higher-level
protocols.

The challenge of representing text in computers


The main body of the Unicode standard is 1,040 pages long, counting indexes and appendices, and
there’s a bunch of supplemental information—addenda, data tables, related substandards,
implementation notes, etc.—on the Unicode Consortium’s Web site. That’s an awful lot of verbiage.
And now here I come with another 500 pages on the subject. Why? After all, Unicode’s just a
character encoding. Sure, it includes a lot of characters, but how hard can it be?

Let’s take a look at this for a few minutes. The basic principle at work here is simple: If you want to
be able to represent textual information in a computer, you make a list of all the characters you want
to represent and assign each one a number.2 Now you can represent a sequence of characters with a

2
Actually, you assign each one a bit pattern, but numbers are useful surrogates for bit patterns, since
there’s a generally-agreed-upon mapping from numbers to bit patterns. In fact, in some character
encoding standards, including Unicode, there are several alternate ways to represent each character in
bits, all based on the same numbers, and so the numbers you assign to them become useful as an
intermediate stage between characters and bits. We’ll look at this more closely in Chapters 2 and 6.

10 Unicode Demystified
Language, Computers, and Unicode

sequence of numbers. Consider the simple number code we all learned as kids, where A is 1, B is 2,
C is 3, and so on. Using this scheme, the word…

food
…would be represented as 6-15-15-4.

Piece of cake. This also works well for Japanese…

…although you need a lot more numbers (which introduces its own set of problems, which we’ll get
to in a minute).

In real life, you may choose the numbers fairly judiciously, to facilitate things like sorting (it’s useful,
for example, to make the numeric order of the character codes follow the alphabetical order of the
letters) or character-type tests (it makes sense, for example, to put all the digits in one contiguous
group of codes and the letters in another, or even to position them in the encoding space such that
you can check whether a character is a letter or digit with simple bit-maskng). But the basic principle
is still the same: Just assign each character a number.

It starts to get harder, or at least less clear-cut, as you move to other languages. Consider this phrase:

à bientôt
What do you do with the accented letters? You have two basic choices: You can either just assign a
different number to each accented version of a given letter, or you can treat the accent marks as
independent characters and given them their own numbers.

If you take the first approach, a process examining the text (comparing two strings, perhaps) can lose
sight of the fact that a and à are the same letter, possibly causing it to do the wrong thing, without
extra code that knows from extra information that à is just an accented version of a. If you take the
second approach, a keeps its identity, but you then have to make decisions about where the code the
accent goes in the sequence relative to the code for the a, and what tells a system that the accent
belongs on top of the a and not some other letter.

For European languages, though, the first approach (just assigning a new number to the accented
version of a letter) is generally considered to be simpler. But there are other situations…

  : -  

…such as this Hebrew example, where that approach breaks down. Here, most of the letters have
marks on them, and the same marks can appear on any letter. Assigning a unique code to every

A Practical Programmer’s Guide to the Encoding Standard 11


The Challenge of representing text in computers

letter-mark combination quickly becomes unwieldy, and you have to go to giving the marks their own
codes.

In fact, Unicode prefers the give-the-marks-their-own-codes approach, but in many cases also
provides unique codes for the more common letter-mark combinations. This means that many
combinations of characters can be represented more than one way. The “à” and “ô” in “à bientôt,” for
example, can be represented either with single character codes, or with pairs of character codes, but
you want “à bientôt” to be treated the same no matter which set of codes is used to represent it, so
this requires a whole bunch of equivalence rules.

The whole idea that you number the characters in a line as they appear from left to right doesn’t just
break down when you add accent marks and the like into the picture. Sometimes, it’s not even
straightforward when all you’re dealing with are letters. In this sentence…

Avram said and smiled.


…which is in English with some Hebrew words embedded, you have an ordering quandary: The
Hebrew letters don’t run from left to right; they run from right to left: the first letter in the Hebrew
phrase, , is the one furthest to the right. This poses a problem for representation order. You can’t
really store the characters in the order the appear on the line (in effect, storing either the English or
the Hebrew “backward,” because it messes up the determination of which characters go on which line
when you break text across lines. But if you store the characters in the order they’re read or typed,
you need to specify just how they are to be arranged when the text is displayed or printed.
The ordering thing can go to even wilder extremes. This example…

…is in Hindi. The letters in Hindi knot together into clusters representing syllables. The syllables
run from left to right across the page like English text does, but the arrangement of the marks within a
syllable can be complicated and doesn’t necessarily follow the order in which the sounds they
correspond to are actually spoken (or the characters themselves are typed). There are six characters
in this word, arranged like this:

12 Unicode Demystified
Language, Computers, and Unicode

Many writing systems have complicated ordering or shaping behavior, and each presents unique
challenges in detemining how to represent the characters as a linear sequence of bits.
You also run into interesting decisions as to just what you mean by “the same character” or “different
characters.” For example, in this Greek word…

«WKHOHWWHU DQGWKHOHWWHU DUHUHDOO\ERWKWKHOHWWHUVLJPD —it’s written one way when it occurs


at the end of a word and a different way when it occurs at the beginning or in the middle. Two
distinct shapes representing the same letter. Do they get one character code or two? This issue
comes up over and over again, as many writing systems have letters that change shape depending on
context. In our Hindi example, the hook in the uppper right-hand corner normally looks like this…


…but can take some very different forms (including the hook) depending on the characters
surrounding it.
You also have the reverse problem of the same shape meaning different things. Chinese characters
often have more than one meaning or pronunciation. Does each different meaning get its own
character code? The letter Å can either be a letter in some Scandinavian languages or the symbol for
the Angstrom unit. Do these two uses get different codes, or is it the same character in both places?
What about this character:

Is this the number 3 or the Russian letter z? Do these share the same character code just because they
happen to look a lot like each other?
For all of these reasons and many others like them, Unicode is more than just a collection of marks
on paper with numbers assigned to them. Every character has a story, and for every character or
group of characters, someone had to sit down and decide whether it was the same as or different from
the other characters in Unicode, whether several related marks got assigned a single number or
several, just what a series of numbers in computer memory would look like when you draw them on
the screen, just how a series of marks on a page would translate into a series of numbers in computer
memory when neither of these mappings was straightforward, how a computer performing various
type of processes on a series of Unicode character codes would do its job, and so on.
So for every character code in the Unicode standard, there are rules about what it means, how it
should look in various situations, how it gets arranged on a line of text with other characters, what
other characters are similar but different, how various text-processing operations should treat it, and
so on. Multiply all these decisions by 94,140 unique character codes, and you begin both to get an
idea of why the standard is so big, and of just how much labor, how much energy, and how much

A Practical Programmer’s Guide to the Encoding Standard 13


The Challenge of representing text in computers

heartache, on the part of so many people, went into this thing. Unicode is the largest, most
comprehensive, and most carefully designed standard of its type, and the toil of hundreds of people
made it that way.

What This Book Does


The definitive source on Unicode is, not surprisingly, the Unicode standard itself. The main body
of the standard is available in book form as The Unicode Standard, Version 3.0, published by
Addison-Wesley and available wherever you bought this book. This book is supplemented by
various tables of character properties covering the exact semantic details of each individual
character, and by various technical reports that clarify, supplement, or extend the standard in
various ways. A snapshot of this supplemental material is on the CD glued to the inside back cover
of the book. The most current version of this supplemental material, and indeed the definitive
source for all the most up-to-date material on the Unicode standard, is the Unicode Web site, at
http://www.unicode.org.

For a long time, the Unicode standard was not only the definitive source on Unicode, it was the
only source. The problem with this is that the Unicode standard is just that: a standard. Standards
documents are written with people who will implement the standard as their audience. They
assume extensive domain knowledge and are designed to define as precisely as possible every
aspect of the thing being standardized. This makes sense: the whole purpose of a standard is to
ensure that a diverse group of corporations and institutions all do some particular thing in the same
way so that things produced by these different organizations can work together properly. If there
are holes in the definition of the standard, or passages that are open to interpretation, you could
wind up with implementations that conform to the standard, but still don’t work together properly.

Because of this, and because they’re generally written by committees whose members have
different, and often conflicting, agendas, standards tend by their very nature to be dry, turgid,
legalistic, and highly technical documents. They also tend to be organized in a way that
presupposes considerable domain knowledge—if you’re coming to the topic fresh, you’ll often
find that to understand any particular chapter of a standard, you have read every other chapter first.

The Unicode standard is better written than most, but it’s still good bedtime reading—at least if
you don’t mind having nightmares about canonical reordering or the bi-di algorithm. That’s where
this book comes it. It’s intended to act as a companion to the Unicode standard and supplement it
by doing the following:

x Provide a more approachable, and more pedagogically organized, introduction to the salient
features of the Unicode standard.
x Capture in book form changes and additions to the standard since it was last published in book
form, and additions and adjuncts to the standard that haven’t been published in book form.
x Fill in background information about the various features of the standard that are beyond the
scope of the standard itself.
x Provide an introduction to each of the various writing systems Unicode represents and the
encoding and implemented challenges presented by each.
x Provide useful information on implementing various aspects of the standard, or using existing
implementations.

14 Unicode Demystified
Language, Computers, and Unicode

My hope is to provide a good enough introduction to “the big picture” and the main components of
the technology that you can easily make sense of the more detailed descriptions in the standard
itself—or know you don’t have to.

This book is for you if you’re a programmer using any technology that depends on the Unicode
standard for something. It will give you a good introduction to the main concepts of Unicode,
helping you to understand what’s relevant to you and what things to look for in the libraries or
APIs you depend on.

This book is also for you if you’re doing programming work that actually involves implementing
part of the Unicode standard and you’re still relatively new either to Unicode itself or to software
internationalization in general. It will give you most of what you need and enough of a foundation
to be able to find complete and definitive answers in the Unicode standard and its technical
reports.

How this book is organized


This book is orgaized into three sections: Section I, Unicode in Essence, provides an architectural
overview of the Unicode standard, explaining the most important concepts in the standard and the
motivations behing them. Section II, Unicode in Depth, goes deeper, taking a close look at each
of the writing systems representable using Unicode and the unique encoding and implementation
problems they pose. Section III, Unicode in Action, take an in-depth look at what goes into
implementing various aspects of the standard, writing code that manipulates Unicode text, and how
Unicode interacts with other standards and technologies.

Section I: Unicode in Essence


Section I provides an introduction to the main structure and most important concepts of the
standard, the things you need to know to deal properly with Unicode whatever you’re doing with it.

Chapter 1, the chapter you’re reading , is the book’s introduction. It gives a very high-level
account of the problem Unicode is trying to solve, the goals and non-goals behind the standard,
and the complexity of the problem. It also sets forth the goals and organization of this book.

Chapter 2 puts Unicode in historical context and relates it to the various other character encoding
standards out there. It discusses ISO 10646, Unicode’s sister standard, and its relationship to the
Unicode standard.

Chapter 3 provides a more complete architectural overview. It outlines the structure of the
standard, Unicode’s guiding design principles, and what it means to conform to the Unicode
standard.

Often, it takes two or more Unicode character codes to get a particular effect, and some effects can
be achieved with two or more different sequences of codes. Chapter 4 talks more about this
concept, the combining character sequence, and the extra rules that specify how to deal with
combining character sequences that are equivalent.

A Practical Programmer’s Guide to the Encoding Standard 15


What this book is for

Every character in Unicode has a large set of properties that define its semantics and how it should
be treated by various processes. These are all set forth in the Unicode Character Database, and
Chapter 5 introduces the database and all of the various character properties it defines.

The Unicode standard is actually in two layers: A layer that defines a transformation between
written text and a series of abstract numeric codes, and a layer that defines a transformation
between those abstract numeric codes and patterns of bits in memory or in persistent storage. The
lower layer, fromabstract numbers to bits, comprises several different mappings, each optimized
for different situations. Chapter 6 introduces and discusses these mappings.

Section II: Unicode in Depth


Unicode doesn’t specifically deal in languages; instead it deals in scripts, or writing systems. A
script is a collection of characters used to represent a group of related languages. Generally, no
language uses all the characters in a script. For example, English is written using the Latin
alphabet. Unicode encodes 819 Latin letters, but English only uses 52 (26 upper- and lower-case
letters). Section II goes through the standard script by script, looking at the features of each script,
the languages that are written with it, and how it’s represented in Unicode. It groups scripts into
families according to their common characteristics.

For example, in Chapter 7 we look at the scripts used to write various European languages. These
scripts generally don’t pose any interesting ordering or shaping problems, but are the only scripts
that have special upper- and lower-case forms. They’re all descended from (or otherwise related
to) the ancient Greek alphabet. This group includes the Latin, Greek, Cyrillic, Armenian, and
Georgian alphabets, as well as various collections of diacritical marks and the International
Phonetic Alphabet.

Chapter 8 looks at the scripts of the Middle East. The biggest feature shared by these scripts is
that they’re written from right to left rather than left to right. They also tend to use letters only for
consonant sounds, using separate marks around the basic letters to represent the vowels. Two
scripts in this group are cursively connected, even in printed text, which poses interesting
representational problems. These scripts are all descended from the ancient Aramaic alphabet.
This group includes the Hebrew, Arabic, Syriac, and Thaana alphabets.

Chapter 9 looks at the scripts of India and Southeast Asia. The letters in these scripts knot
together into clusters that represent whole syllables. The scripts in this group all descend from the
ancient Brahmi script. This group includes the Devanagari script used to write Hindi and Sanskrit,
plus eighteen other scripts, including such things as Thai and Tibetan.

In Chapter 10, we look at the scripts of East Asia. The interesting things here are that these
scripts comprise tends of thousands of unique, and often complicated, characters (the exact number
is impossible to determine, and new characters are coined all the time). These characters are
generally all the same size, don’t combine with each other, and can be written either from left to
right or vertically. This group includes the Chinese characters and various other writing systems
that either are used with Chinese characters or arose under their influence.

While most of the written languages of the world are written using a writing system that falls into
one of the above groups, not all of them do. Chapter 11 discusses the other scripts, including
Mongolian, Ethiopic, Cherokee, and the Unified Canadian Aboriginal Syllabics, a set of characters
used for writing a variety of Native American languages. In addition to the modern scripts,

16 Unicode Demystified
Language, Computers, and Unicode

Unicode also encodes a growing number of scripts that are not used anymore but are of scholarly
interest. The current version of Unicode includes four of there, which are also discussed in
Chapter 11.

But of course, you can’t write only with the characters that represent the sounds or words of
spoken language. You also need things like punctuation marks, numbers, symbols, and various
other non-letter characters. These are covered in Chapter 12, along with various special
formatting and document-structure characters.

Section III: Unicode in Action


This section goes into depth on various techniques that can be used in code to implement or make
use of the Unicode standard.

Chapter 13 provides an introduction to the subject, discussing a group of generic data structures
and techniques that are useful for various types of processes that operate on Unicode text.

Chapter 14 goes into detail on how to perform various types of transformations and conversions
on Unicode text. This includes converting between the various Unicode serialization formats,
performing Unicode compression and decompression, performing Unicode normalization,
converting between Unicode and other encoding standards, and performing case mapping and case
folding.

Chapter 15 zeros in on two of the most text-analysis processes: searching and sorting. It talks
about both language-senaitive and language-insensitive string comparison and how searching and
sorting algorithms build on language-sensitive string comparison.

Chapter 16 discusses the most important operations performed on text: drawing it on the screen
(or other output devices) and accepting it as input, otherwise known as rendering and editing. It
talks about dividing text up into lines, arranging characters on a line, figuring out what shape to use
for a particular character or sequence of characters, and various special considerations one must
deal with when writing text-editing software.

Finally, in Chapter 17 we look at the place where Unicode intersects with other computer
technologies. It discusses Unicode and the Internet, Unicode and various programming languages,
Unicode and various operating systems, and Unicode and database technology.

A Practical Programmer’s Guide to the Encoding Standard 17


CHAPTER 2 A Brief History of Character
Encoding

To understand Unicode fully, it’s helpful to have a good sense of where we came from, and what this
whole business of character encoding is all about. Unicode didn’t just spring fully-grown from the
forehead of Zeus; it’s the latest step in a history that actually predates the digital computer, having its
roots in telecommunications. Unicode is not the first attempt to solve the problem it solves, and
Unicode is also in its third major revision. To understand the design decisions that led to Unicode
3.0, it’s useful to understand what worked and what didn’t work in Unicode’s many predecessors.

This chapter is entirely background—if you want to jump right in and start looking at the features and
design of Unicode itself, feel free to skip this chapter.

Prehistory
Fortunately, unlike, say, written language itself, the history of electronic (or electrical)
representations of written language doesn’t go back very far. This is mainly, of course, because the
history of the devices using these representations doesn’t go back very far.

The modern age of information technology does, however, start earlier than one might think at first—
a good century or so before the advent of the modern digital computer. We can usefully date the
beginning of modern information technology from Samuel Morse’s invention of the telegraph in
1837.3

3
My main source for this section is Tom Jennings, “Annotated History of Character Codes,” found at
http://www.wps.com/texts/codes.

19
A Brief History of Character Encoding

The telegraph, of course, is more than just an interesting historical curiosity. Telegraphic
communication has never really gone away, although it’s morphed a few times. Even long after the
invention and popularization of the telephone, the successors of the telegraph continued to be used to
send written communication over a wire. Telegraphic communications was used to send large
volumes of text, especially when it needed ultimately to be in written form (news stories, for
example), or when human contact wasn’t especially important and saving money on bandwidth was
very important (especially for routine business communications such as orders and invoices). These
days, email and EDI are more or less the logical descendants of the telegraph.

The telegraph and Morse code


So our story starts with the telegraph. Morse’s original telegraph code actually worked on numeric
codes, not the alphanumeric code that we’re familiar with today. The idea was that the operators on
either end of the line would have a dictionary that assigned a unique number to each word, not each
letter, in English (or a useful subset). The sender would look up each word in the dictionary, and
send the number corresponding to that word; the receiver would do the opposite. (The idea was
probably to automate this process in some way, perhaps with some kind of mechanical device that
would point to each word in a list or something.)

This approach had died out by the time of Morse’s famous “WHAT HATH GOD WROUGHT”
demonstration in 1844. By this time, the device was being used to send the early version of what we
now know as “Morse code,” which was probably actually devised by Morse’s assistant Alfred Vail.

Morse code was in no way digital in the sense we think of the term—you can’t easily turn it into a
stream of 1 and 0 bits the way you can with many of the succeeding codes. But it was “digital” in the
sense that it was based on a circuit that had only two states, on and off. This is really Morse’s big
innovation; there were telegraph systems prior to Morse, but they were based on sending varying
voltages down the line and deflecting a needle on a gauge of some kind.4 The beauty of Morse’s
scheme is a higher level of error tolerance—it’s a lot easier to tell “on” from “off” than it is to tell
“half on” from “three-fifths” on. This, of course, is also why modern computers are based on binary
numbers.

The difference is that Morse code is based not on a succession of “ons” and “offs,” but on a
succession of “ons” of different lengths, and with some amount of “off” state separating them. You
basically had two types of signal, a long “on” state, usually represented with a dash and pronounced
“dah,” and a short “on” state, usually represented by a dot and pronounced “dit.” Individual letters
were represented with varying-length sequences of dots and dashes.

The lengths of the codes were designed to correspond roughly to the relatively frequencies of the
characters in a transmitted message. Letters were represented with anywhere from one to four dots
and dashes. The two one-signal letters were the two most frequent letters in English: E was
represented with a single dot and T with a single dash. The four two-signal letters were I (. .), A (. –
), N (– .). and M (– –). The least common letters were represented with the longest codes: Z (– – . .),
Y (– . – –), J (. – – –), and Q (– – . –). Digits were represented with sequences of five signals, and
punctuation, which was used sparingly, was represented with sequences of six signals.

The dots and dashes were separated by just enough space to keep everything from running together,
individual characters by longer spaces, and words by even longer spaces.

4
This tidbit comes from Steven J. Searle, “A Brief History of Character Codes,” found at
http://www.tronweb.super-nova-co-jp/characcodehist.html.

20 Unicode Demystified
Prehistory

As an interesting sidelight, the telegraph was designed as a recording instrument—the signal operated
a solenoid that caused a stylus to dig shallow grooves in a moving strip of paper. (The whole thing
with interpreting Morse code by listening to beeping like we’ve all seen in World War II movies
came later, with radio, although experienced telegraph operators could interpret the signal by
listening to the clicking of the stylus.) This is a historical antecedent to he punched-tape systems
used in teletype machines and early computers.

The teletypewriter and Baudot code


Of course, what you really want isn’t grooves on paper, but actual writing on paper, and one of the
problems with Morse’s telegraph is that the varying lengths of the signals didn’t lend itself well to
driving a mechanical device that could put actual letters on paper. The first big step in this direction
was Emile Baudot’s “printing telegraph,” invented in 1874.

Baudot’s system didn’t use a typewriter keyboard; it used a pianolike keyboard with five keys, each
of which controlled a separate electrical connection. The operator operated two keys with the left
hand and three with the right and sent each character by pressing down some combination of these
five keys simultaneously. (So the “chording keyboards” that are somewhat in vogue today as a way
of combating RSI aren’t a new idea—they go all the way back to 1874.)

[should I include a picture of the Baudot keyboard?]

The code for each character is thus some combination of the five keys, so you can think of it as a
five-bit code. Of course, this only gives you 32 combinations to play with, kind of a meager
allotment for encoding characters. You can’t even get all the letters and digits in 32 codes.

The solution to this problem has persisted for many years since: you have two separate sets of
characters assigned to the various key combinations, and you steal two key combinations to switch
between them. So you end up with a LTRS bank, consisting of twenty-eight letters (the twenty-six
you’d expect, plus two French letters), and a FIGS bank, consisting of twenty-eight characters: the
ten digits and various punctuation marks and symbols. The three left-hand-only combinations don’t
switch functions: two of them switch back and forth between LTRS and FIGS, and one (both left-
hand keys together) was used to mean “ignore the last character” (this later evolves into the ASCII
DEL character). The thirty-second combination, no keys at all, of course didn’t mean anything—no
keys at all was what separated one character code from the next. (The FIGS and LTRS signals
doubled as spaces.)

So you’d go along in LTRS mode, sending letters. When you came to a number or punctuation mark,
you’d send FIGS, send the number or punctuation, then send LTRS and go back to sending words
again. “I have 23 children.” would thus get sent as

I HAVE [FIGS] 23 [LTRS] CHILDREN [FIGS] .

It would have been possible, of course, to get the same effect by just adding a sixth key, but this was
considered too complicated mechanically.

Even though Baudot’s code (actually invented by Johann Gauss and Wilhelm Weber) can be thought
of as a series of five-bit numbers, it isn’t laid out like you might lay out a similar thing today: If you
lay out the code charts according to the binary-number order, it looks jumbled. As with Morse code,
characters were assigned to key combinations in such a way as to minimize fatigue, both to the

A Practical Programmer’s Guide to the Encoding Standard


A Brief History of Character Encoding

operator and to the machinery. More-frequent characters, for example, used fewer fingers than less-
frequent characters.

Other teletype and telegraphy codes


The logical next step from Baudot’s apparatus would, of course, be a system that uses a normal
typewriterlike keyboard instead of the pianolike keyboard used by Baudot. Such a device was
invented by Donald Murray sometime between 1899 and 1901. Murray kept the five-bit two-state
system Baudot had used, but rearranged the characters. Since there was no longer a direct correlation
between the operator’s hand movements and the bits being sent over the wire, there was no need to
worry about arranging the code to minimize operator fatigue; instead, Murray designed his code
entirely to minimize wear and tear on the machinery.

A couple interesting developments occur first in the Murray code: You see the debut of what later
became known as “format effectors” or “control characters”—the CR and LF codes, which,
respectively return the typewriter carriage to the beginning of the line and advance the platen by one
line.5 Two codes from Baudot also move to the positions where they stayed since (at least until the
introduction of Unicode, by which time the positions no longer mattered): the NULL or BLANK all-
bits-off code and the DEL all-bits-on code. All bits off, fairly logically, meant that the receiving
machine shouldn’t do anything; it was essentially used as an idle code for when no messages were
being sent. On real equipment, you also often had to pad codes that took a long time to execute with
NULLs: If you issued a CR, for example, it’d take a while for the carriage to return to the home
position, and any “real” characters sent during this time would be lost, so you’d sent a bunch of extra
NULLs after the CR. This would put enough space after the CR so that the real characters wouldn’t
go down the line until the receiving machine could print them, and not have any effect (other than
maybe to waste time) if the carriage got there while they were still being sent.

The DEL character would also be ignored by the receeiving equipment. The idea here is that if
you’re using paper tape as an interrmediate storage medium (as we still see today, it became common
to compose a message while off line, storing it on paper tape, and then log on and send the message
from the paper tape, ratherthan “live” from the keyboard) and you make a mistake, the only way to
blank out the mistake is to punch out all the holes in the line with the mistake. So a row with all the
holes punched out (or a character with all the bits set, as we think of it today) was treated as a null
character.

Murray’s code forms the basis of most of the various telegraphy codes of the next fifty years or so.
Western Union picked it up and used it (with a few changes) as its encoding method all the way
through the 1950s. The CCITT (Consultative Committee for International Telephone and Telegraph,
a European standards body) picked up the Western Union code and, with a few changes, blessed it as
an international standard, International Telegraphy Alphabet #2 (“ITA2” for short).

The ITA2 code is often referred to today as “Baudot code,” although it’s significantly different from
Baudot’s code. It does, however, retain many of the most important features of Baudot’s code.

Among the interesting differences between Murray’s code and ITA2 are the addition of more
“control codes”: You see the introduction of an explicit space character, rather than using the all-bits-
off signal, or the LTRS and FIGS signals, as spaces. There’s a new BEL signal, which rings a bell or

5
I’m taking my cue from Jennings here: These code positions were apparently marked “COL” and “LINE
PAGE” originally; Jennings extrapolates back from later codes that had CR and LF in the same positions and
assumes that “COL” and “LINE PAGE” were alternate names for the same functions.

22 Unicode Demystified
Prehistory

produces some other audible signal on the receiving end. And you see the first case of a code that
exists explicitly to control the communications process itself—the WRU, or “Who are you?” code,
which would cause the receiving machine to send some identifying stream of characters back to the
sending machine (this enabled the sender to make sure he was connected to the right receiver before
sending sensitive information down the wire, for example).

This is where inertia sets in in the industry. By the time ITA2 and its narional variants came into use
in the 1930s, you had a significant number of teletype machines out there, and you had the weight of
an international standard behind one encoding method. The ITA2 code would be the code used by
teletype machines right on into the 1960s. When computers started communicating with the outside
world in real time using some kind of terminal, the terminal would be a teletype machine, and the
computer would communicate with it using the teletype codes of the day. (The other main way
computers dealt with alphanumeric data was through the use of punched cards, which had their own
encoding schemes we’ll look at in a minute.)

FIELDATA and ASCII


The teletype codes we’ve looked at are all five-bit codes with two separate banks of characters. This
means the codes all include the concept of “state”: the interpretation of a particular code depends on
the last bank-change signal you got (some operations also included an implicit bank change—a
carriage return, for example, often switched the system back to LTRS mode). This makes sense as
long as you’re dealing with a completely serial medium where you don’t have random access into the
middle of a character stream and as long as you’re basically dealing with a mechanical system (LTRS
and FIGS would just shift the printhead or the platen to a different position). In computer memory,
where you might want random access to a character and not have to scan backwards an arbitrary
distance for a LTRS or FIGS character to find out what you’re looking at, it generally made more
sense to just use an extra bit to store the LTRS or FIGS state. This gives you a six-bit code, and for a
long time, characters were thought of as six-bit units. (Punched-card codes of the day were also six
bits in length, so you’ve got that, too.) This is one reason why many computers of the time had word
lengths that were multiples of 6: As late as 1978, Kernighan and Ritchie mention that the int and
short data types were 36 bits wide on the Honeywell 6000, implying that it had a 36-bit word
length.6

By the late 1950s, the computer and telecommunications industries were both starting to chafe under
the limitations of the six-bit teletype and punched-card codes of the day, and a new standards effort
was begun that eventually led to the ASCII code. An important predecessor of ASCII was the
FIELDATA code used in various pieces of communications equipment designed and used by the
U.S. Army starting in 1957. (It bled out into civilian life as well; UNIVAC computers of the day
were based on a modified version of the FIELDATA code.)

FIELDATA code was a seven-bit code, but was divided into layers in such a way that you could
think of it as a four-bit code with either two or three control bits appended, similar to punched-card
codes. It’s useful7 to think of it has having a five-bit core somewhat on the ITA2 model with two
control bits. The most significant, or “tag” bit, is used similarly to the LTRS/FIGS bit discussed
before: it switches the other six bits between two banks: the “alphabetic” and “supervisory” banks.
The next-most-significant bit shifted the five core bits between two sub-banks: the alphabetic bank
was shifted between upper-case and lower-case sub-banks, and the supervisory bank between a
supervisory and a numeric/symbols sub-bank.

6
See Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language, first edition (Prentice-Hall,
1978), p. 34.
7
Or at least I think it’s useful, looking at the code charts—my sources don’t describe things this way.

A Practical Programmer’s Guide to the Encoding Standard


A Brief History of Character Encoding

In essence, the extra bit made it possible to include both upper- and lower-case letters for the first
time, and also made it possible to include a wide range of control, or supervisory, codes, which were
used for things like controlling the communication protocol (there were various handshake and
termination codes) and separating various units of variable-length structured data.

Also, within each five-bit bank of characters, we finally see the influence of computer technology:
the characters are ordered in binary-number order.

FIELDATA had a pronounced effect on ASCII, which was being designed at the same time.
Committee X3.4 of the American Standards Association (now the American National Standards
Institute, or ANSI) had been convened in the late 1950s, about the time FIELDATA was deployed,
and consisted of representatives from AT&T, IBM (which, ironically, didn’t actually use ASCII until
the IBM PC came out in 1981), and various other companies from the computer and
telecommunications industries.

The first result of this committee’s efforts was what we now know as ANSI X3.4-1963, the first
version of the American Standard Code for Information Interchange, our old friend ASCII (which,
for the three readers of this book who don’t already know this, is pronounced “ASS-key”). ASCII-
1963 kept the overall structure of FIELDATA and regularized some things. For example, it’s a pure
7-bit code and is laid out in such a way as to make it possible to reasonably sort a list into
alphabetical order by comparing the numeric character codes directly. ASCII-1963 didn’t officially
have the lower-case letters in it, although there was a big undefined space where they were obviously
going to go, and had a couple weird placements, such as a few control codes up in the printing-
character range. These were fixed in the next version of ASCII, ASCII-1967, which is more or less
the ASCII we all know and love today. It standardized the meanings of some of the control codes
that were left open in ASCII-1963, moved all of the control codes (except for DEL, which we talked
about) into the control-code area, and added the lower-case letters and some special programming-
language symbols. It also discarded a couple of programming-language symbols from ASCII-1963
in a bow to international usage: the upward-pointing arrow used for exponentiation turned into the
caret (which did double duty as the circumflex accent), and the left-pointing arrow (used sometimes
to represent assignment) turned into the underscore character.

Hollerith and EBCDIC


So ASCII has its roots in telegraphy codes that go all the way back to the nineteenth century, and in
fact was designed in part as telegraphy code itself. The other important category of character codes
is the punched-card codes. Punched cards were for many years the main way computers dealt with
alphanumeric data. In fact, the use of the punched card for data processing predates the modern
computer, although not by as long as the telegraph.

Punched cards date back at least as far as 1810, when Joseph-Marie Jacquard used them to control
automatic weaving machines, and Charles Babbage had proposed adapting Jacquard’s punched cards
for use in his “analytical engine.” But punched cards weren’t actually used for data processing until
the 1880s when Herman Hollerith, a U.S. Census Bureau employee, devised a method of using
punched cards to collate and tabulate census data. His punched cards were first used on a national
scale in the 1890 census, dramatically speeding the tabulation process: The 1880 census figures,
calculated entirely by hand, had taken seven years to tabulate. The 1890 census figures took six
weeks. Flush with the success of the 1890 census, Hollerith formed the Tabulating Machine
company in 1896 to market his punched cards and the machines used to punch, sort, and count them.

24 Unicode Demystified
Prehistory

This company eventually merged with two others, diversified into actual computers, and grew into
what we now know as the world’s largest computer company, International Business Machines.8

The modern IBM rectangular-hole punched card was first introduced in 1928 and has since become
the standard punched-card format. It had 80 columns, each representing a single character (it’s no
coincidence that most text-based CRT terminals had 80-column displays). Each column had twelve
punch positions. This would seem to indicate that Hollerith code, the means of mapping from punch
positions to characters, was a 12-bit code, which would give you 4,096 possible combinations.

It wasn’t. This is because you didn’t actually want to use all the possible combinations of punch
positions—doing so would put too many holes in the card and weaken its structural integrity (not a
small consideration when these things are flying through the sorting machine at a rate of a few dozen
a second). The system worked like this: You had 10 main rows of punch holes, numbered 0 through
9, and two “zone” rows, 11 and 12 (row 0 also did double duty as a zone row). The digits were
represented by single punches in rows 0 through 9, and the space was represented with no punches at
all.

The letters were represented with two punches: a punch in row 11, 12, or 0 plus a punch in one of the
rows from 1 to 9. This divides the alphabet into three nine-letter “zones.” A few special characters
were represented with the extra couple one-punch combinations and the one remaining two-punch
combination. The others were represented with three-punch combinations: row 8 would be pressed
into service as an extra zone row and the characters would be represented with a combination of a
punch in row 8, a punch in row 11, 12, or 0, and a punch in one of the rows from 1 to 7 (row 9 wasn’t
used). This gave you three banks of seven symbols each (four banks if included a bank that used row
8 as a zone row without an additional punch in rows 11, 12, or 0). All together, you got a grand total
of 67 unique punch combinations. Early punched card systems didn’t use all of these combinations,
but later systems filled in until all of the possible combinations were being used.

[would a picture be helpful here?]

In memory, the computers used a six-bit encoding system tied to the punch patterns: The four least-
significant bits would specify the main punch row (meaning 10 of the possible 16 combinations
would be used), and the two most-significant bits would identify which of the zone rows was
punched. (Actual systems weren’t always quite so straightforward, but this was the basic idea.)
This was sufficient to reproduce all of the one- and two-punch combinations in a straightforward
manner, and was known as the Binary Coded Decimal Information Code, BCDIC, or just BCD for
short.

IBM added two more bits to BCD to form the Extended Binary Coded Decimal Information Code, or
EBCDIC (pronounced “EB-suh-dick”), which first appeared with the introduction of the System/360
in 1964. It was backward compatible with the BCD system used in the punched cards, but added the
lower-case letters and a bunch of control codes borrowed from ASCII-1963 (you didn’t really need
control codes in a punched-card system, since the position of the columns on the cards, or the
division of information between cards, gave you the structure of the data and you didn’t need codes
for controlling the communication session. This code was designed both to allow a simple mapping
from character codes to punch positions on a punched card and, like ASCII, to produce a reasonable
sorting order when the numeric codes were used to sort character data (note that this doesn’t mean

8
Much of the information in the preceding paragraph is drawn from the Searle article; the source for the rest of
this section is Douglas W. Jones, “Punched Cards: An Illustrated Technical History” and “Doug Jones’
Punched Card Codes,” both found at http://www.cs.uiowa.edu/~jones/cards.

A Practical Programmer’s Guide to the Encoding Standard


A Brief History of Character Encoding

you get the same order as ASCII—digits sort after letters instead of before them, and lower case sorts
before upper case instead of after).

One consequence of EBCDIC’s lineage is that the three groups of nine letters in the alphabet that you
had in the punched-card codes are numerically separated from each other in the EBCDIC encoding: I
is represented by 0xC9 and J by 0xD1, leaving an eight-space gap in the numerical sequence. The
original version of EBCDIC encoded only 50 characters in an 8-bit encoding space, leaving a large
numbers of gaping holes with no assigned characters. Later versions of EBCDIC filled in these holes
in various ways, but retained the backward compatibility with the old punched-card system.
Although ASCII is finally taking over in IBM’s product line, EBCDIC still survives in the current
models of IBM’s System/390, even though punched cards are long obsolete. Backward compatibility
is a powerful force.

Single-byte encoding systems


ANSI X3.4-1967 (ASCI-1967) went on to be adopted as an international standard, first by the
European computer Manufacturers’ Association as ECMA-6 (which actually came out in 1965, two
years before the updated version of ANSI X3.4, and then by the International Organization for
Standardization (“ISO” for short) as ISO 646 in 1972. [This is generally the way it works with ISO
standards—ISO serves as an umbrella organization for a bunch of national standards bodies and
generally creates international standards by taking various national standards, modifying them to be
more palatable to an international audience, and republishing them as international standards.]

A couple interesting things happened to ASCII on its way to turning into ISO 646. First, it
formalized a system for applying the various accent and other diacritical marks used is European
languages to the letters. It did this not by introducing accented variants of all the letters–there was no
room for that—but by pressing a bunch of punctuation marks into service as diacritical marks. The
apostrophe did double duty as the acute accent, the opening quote mark as the grave accent, the
double quotation mark as the diaeresis (or umlaut), the caret as the circumflex accent, the swung dash
as the tilde, and the comma as the cedilla. To produce an accented letter, you’d follow the letter in
the code sequence with a backspace code and then the appropriate ‘accent mark.” On a teletype
machine, this would cause the letter to be overstruck with the punctuation mark, producing an ugly
but serviceable version of the accented letter.

The other thing is that ISO 646 leaves the definitions of twelve characters open (in ASCII, these
eleven characters are #, $, @, [, \, ], ^, ‘, {, |, }, and ~). These are called the “national use” code
positions. There’s an International Reference Version of ISO 646 that gives the American meanings
to the corresponding code points, but other national bodies were free to assign other characters to
these twelve code values. (Some national bodies did put accented letters in these slots.) The various
national variants have generally fallen out of use in favor of more-modern standards like ISO 8859
(see below), but vestiges of the old system still remain; for example, in many Japanese systems’
treatment of the code for \ as the code for ¥.9

9
The preceding information comes partially from the Jennings paper, and partially from Roman Czyborra,
“Good Old ASCII,” found at http://www.czyborra.com/charsets/iso646.html.

26 Unicode Demystified
Single-byte encoding systems

Eight-bit encoding schemes and the ISO 2022 model

ASCII gave us the eight-bit byte. All earlier encoding systems (except for FIELDATA, which can be
considered an embryonic version of ASCII) used a six-bit byte. ASCII extended that to seven, and
most communication protocols tacked on an eighth bit as a parity-check bit. As the parity bit became
less necessary (especially in computer memory) and 7-biit ASCII codes were stored in eight-bit
computer bytes, it was only natural that the 128 bit combinations not defined by ASCII (the ones with
the high-order bit set) would be pressed into service to represent more characters.

This was anticipated as early as 1971, when the ECMA-35 standard was first published.10 This
standard later became ISO 2022. ISO 2022 sets forth a standard method of organizing the code
space for various character encoding methods. An ISO-2022-compliant character encoding can have
up to two sets of control characters, designated C0 and C1, and up to four sets of printing (“graphic”)
characters, designated G0, G1, G2, and G3.

The encoding space can either be seven or eight bits wide. In a seven-bit encoding, the byte values
from 0x00 to 0x1F are reserved for the C0 controls and the byte values from 0x20 to 0xFF are used
for the G0, G1, G2, and G3 sets. The range defaults to the G0 set, and escape sequences can be used
to switch it to one of the other sets.

In an eight-bit encoding, the range of values from 0x00 to 0x1F (the “CL area”) is the C0 controls
and the range from 0x80 to 0x9F (the “CR area”) is the C1 controls. The range from 0x20 to 0x7F
(the “GL area”) is always the G0 characters, and the range from 0xA0 to 0xFF (the “GR area”) can
be switched between the G1, G2, and G3 characters.

Control functions (i.e., signals to the receiving process, as opposed to printing characters) can be
represented either as single control characters or as sequences of characters usually beginning with a
control character (usually that control character is ESC and the multi-character control function is an
“escape sequence”). ISO 2022 uses escape sequences to represent C1 controls in the 7-bit systems,
and to switch the GR and GL areas between the various sets of printing characters in both the 7- and
8-bit versions. It also specifies a method of using escape sequences to associate the various areas of
the encoding space (C0, C1, G0, G1, G2, and G3) with actual sets of characters.

ISO 2022 doesn’t actually apply semantics to most of the code positions–the big exceptions are ESC,
DEL and the space, which are given the positions they have in ASCII. Other than these, character
semantics are taken from other standards, and escape sequences can be used to switch an area’s
interpretation from one standard to another (there’s a registry of auxiliary standards and the escape
sequences used to switch between them).

An ISO 2022-based standard can impose whatever semantics it wants on the various encoding areas
and can choose to use or not use the escape sequences for switching things. As a practical matter,
most ISO2022-derived standards put the ISO 646 IRV (i.e., US ASCII) printing characters in the G0
area and the C0 control functions from ISO 6429 (an ISO standard that defines a whole mess of
control functions, along with standardized C0 and C1 sets—the C0 set is the same as the ASCII
control characters).

10
The information in this section is taken from Peter K. Edberg, “Survey of Character Encodings,” Proceedings
of the 13th International Unicode Conference, session TA4, September 9, 1998, and from a quick glance at
the ECMA 35 standard itself.

A Practical Programmer’s Guide to the Encoding Standard


A Brief History of Character Encoding

ISO 8859
By far the most important of the ISO 2022-derived encoding schemes is the ISO 8859 family. 11 The
ISO 8859 standard comprises fourteen separate ISO 2022-compliant encoding standards, each
covering a different set of characters for a different set of languages. Each of these counts as a
separate standard: ISO 8859-1, for example, is the near-ubiquitous Latin-1 character set you see on
the Web.

Work on the ISO 8859 family began in 1982 as a joint project of ANSI and ECMA. The first part
was originally published in 1985 as ECMA-94. This was adopted as ISO 8859-1, and a later
addition of ECMA-94 became the first four parts of ISO 8859. The other parts of ISO 8859 likewise
originated in various ECMA standards.

The ISO 8859 series is oriented, a one might expect, toward European languages and European
usage, as well as certain languages around the periphery of Europe that get used a lot there. It aimed
to do a few things: 1) Do away with the use of backspace sequences as the way to represent accented
characters (the backspace thing is workable on a teletype, but doesn’t work at all on a CRT terminal
without some pretty fancy rendering hardware), 2) do away with the varying meanings of the
“national use” characters from ISO 646, replacing them with a set of code values that would have the
same meaning everywhere and still include everyone’s characters, and 3) unify various other national
and vendor standards that were attempting to do the same thing.

All of the parts of ISO 8859 are based on the ISO 2022 structure, and all have a lot in common.
Each of them assigns the ISO 646 printing characters to the G0 range, and the ISO 6429 C0 and C1
control characters to the C0 and C1 ranges. This means that each of them, whatever else they
include, includes the basic Latin alphabet and is downward compatible with ASCII. (That is, pure 7-
bit ASCII text, when represented with 8-bit bytes, conforms to any of the 8859 standards.) Where
they differ is in their treatment of the G1 range (none of them defines anything in the G2 or G3 areas
or uses escape sequences to switch interpretations of any of the code points, although you can use the
registered ISO 2022 escape sequences to assign the G1 repertoire from any of these standards to the
various GR ranges in a generalized ISO 2022 implementation).

The fourteen parts of ISO 8859 are as follows:

ISO 8859-1 Latin-1 Western European languages (French, German, Spanish, Italian, the
Scandinavian languages etc.)

ISO 8859-2 Latin-2 Eastern European languages (Czech, Hungarian, Polish, Romanian, etc.)

ISO 8859-3 Latin-3 Southern European languages (Maltese and Turkish, plus Esperanto)

ISO 8859-4 Latin-4 Northern European languages (Latvian, Lithuanian, Estonian,


Greenlandic, and Sami)

ISO 8859-5 Cyrillic Russian, Bulgarian, Ukrainian, Belarusian, Serbian, and Macedonian

11
Much of the information in this section is drawn from Roman Czyborra, “ISO 8859 Alphabet Soup”, found at
http://www.czyborra.com/charsets/iso8859.html, supplemented with info from the ISO
Web site, the ECMA 94 standard, and the Edberg article.

28 Unicode Demystified
Other documents randomly have
different content
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like