UnixReview.com
November 2006
Book Review: Unicode Explained
Reviewed by Cameron Laird
Unicode Explained
Jukka K. Korpela
O'Reilly, 2006
0-596-10121-X
678
pages, $59.99

Why is Unicode so hard?
For good reasons: its complications have complications, and it's hard to isolate any part small enough to understand that isn't deeply coupled to much else. Three broad themes that illustrate this difficulty are:
- Like rocket science or networking, Unicode has lots of pieces. Those of us who "learn by doing" typically have to pull together a Unicode-savvy application, a useful font, some sort of "input method", knowledge of a human language other than our native one, and perhaps a reconfiguration of the operating system and/or the keyboard, before we can see a working example of Unicode doing something useful. Imagine how different Little League baseball would be if all the players had to be competent in all skills before their first practice.
- While there is a single "Unicode standard", it depends on dozens of other specifications, standards, and definitions, all linked in complicated ways. The standards rarely make good tutorials, and occasionally are impenetrable even as references, so entry in this domain involves navigation through a maze of primary texts and commentaries on them, with occasional inconsistencies across dimensions of time, treatment, and author, despite the best efforts of very smart and hard-working people. In some cases, simply understanding how to read a particular document — Is it advisory? Does it still apply? Is it intended to be specialized? — is a challenge.
- Unicode exhibits politics in the vernacular sense — the kinds of disputes that motivate people who command armies. While 1s and 0s usually excite only computing insiders, Unicode codifies decisions that inspire passions among "civilians": the correct way to write the Tibetan language, whether English in India is the same language as English in Australia, and which reformations of Chinese characters are implicitly valid are the kinds of questions that simply cannot be answered on a purely technical basis.
There's good news, though: Jukka Korpela's Unicode Explained makes Unicode comprehensible. I've been working occasionally with Unicode for almost a decade, but I find I understand parts of it much better now that I've read his book.
Unicode Explained isn't unique in its values; several introductions to Unicode have been assembled by passionate, deeply informed authors who handle the topic's difficulties fairly and with insight. Among these, Unicode Explained deserves attention as the most recent and the one that exhibits the most scholarly refinements. Over and over, Korpela "goes the extra mile" for readers by his introduction of specific details and concepts crucial to understanding. Rather than a glib syllogism about how typographic unification can go to excess, he presents specific examples from Scandinavian languages, possessive punctuation, and speech synthesis (is it obvious that "Charles I ..." is about the first in a sequence of kings, and that "I" is neither a pronoun nor an initial?) to make his point. He's careful and explicit to keep HTML, CSS, and XML separate in all their manifestations. The entire book is dense with this sort of illuminating substance.
An introduction to Unicode is different from one on SQLite, say, or even a topic as broad as cryptography, because the subject of Unicode is so unavoidably incoherent. Unicode deals with human languages and their typographic representations and must expand to all the messiness we humans achieve. A good author on Unicode can't be just a formal prodigy in a bounded subject like chess, for instance. Instead, he must be experienced in all sorts of esoterica. Korpela appears to have devoted himself to the subject, with Unicode Explained the helping hand he generously offers those of us who merely use Unicode.
Conclusion
My recommendation, then, if you work at all outside the ASCII table or standard Latin alphabet, is to keep a copy of Unicode Explained at your desk. It's a wonderful reference for such common questions as:
- When should I use UTF-8, UCS-4, ISO-8859-1, UTF-16, and so on? You can read his answers for yourself, in the free online sample (Chapter 3).
- How do I encode mathematical subscripts?
- What keystrokes tell
Emacs
I want a '\xe5' in my text? - Who uses IPA?
- Where are free fonts available?
- Why does acceptance of Unicode in Web applications constitute such a security hazard?
There are a very few places where Unicode Explained is confusing or misleading. Korpela, for instance, doesn't distinguish mathematicians from physicists, which leads to error in explaining the symbols the former use.
These missteps are minor, though. If you read, write, or program with human languages other than English (or perhaps Hawaiian or a very few others), you'll do well to keep Unicode Explained at hand.
Cameron is vice president of the Phaseit, Inc., consultancy, specializing in high-reliability and high-performance applications managed by high-level languages. He has reviewed more than 50 books for UnixReview.com, and has had a life-long interest and involvement with several human languages apart from English.