Java@Work | Parlez-Vous Java? (Web Techniques, Sep 2000)

, January 01, 2002

Nothing has made the world smaller than the Web. Today more than ever, you're likely to develop programs that will serve people in countries around the world. However, Esperanto notwithstanding, there's no international language, and some percentage of your users won't be native English speakers. In fact, many may not speak or read English at all. To make sure these visitors can use your programs with ease, you need to take internationalization issues into account.

Web Techniques: Sidebar

Sidebar

ASCII, Unicode, and UTF-8

How do your programs store characters? For years, the answer was the ASCII format, but that's no longer true. Internally, Java uses the Unicode format. Unicode is a standard 16-bit character set used to represent glyphs for nearly every known language and a number of extra symbols.

For external data, Java uses an encoding scheme known as UTF-8. This is a particular way of storing characters in which the initial bit pattern determines the number of bytes in the character. (Remember that there are eight bits in a byte, numbered from 0 to 7.) Having this variable bit pattern lets you store data in an efficient but versatile manner. There are three types of characters possible in the UTF-8 scheme: one-byte, two-byte, and three-byte.

One-byte characters: If bit number 7 of the first byte is set to 0, then the character is made up of only one byte.

Two-byte characters: If the first three bits (numbers 7, 6, and 5) are set to 110 (binary), then the character consists of two bytes. In this case, the second byte must begin with 10 (binary), which leaves 11 bits remaining to define the character.

Three-byte characters: If the character requires more than 11 significant bits, the UTF-8 scheme says that Java must use three bytes to store it. In this case, the first byte must start with 1110 (binary). The next two bytes each start with 10 (binary). This lets you store the full 16 bits.

The UTF scheme has several advantages. All ASCII files are already proper UTF-8 files, so you don't have to convert any existing data. In addition, because of the bit patterns it's easy to recognize whether a byte starts a sequence, belongs to a sequence, or is its own character by looking at the starting bits. (Any byte that's part of a sequence starts with 10 (binary); any byte that does not start with 10 (binary) either begins a sequence or is a single byte.)

—AW

Previous 1 2 3 4 5 6 7 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Java@Work | Parlez-Vous Java? (Web Techniques, Sep 2000)

ASCII, Unicode, and UTF-8

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Java@Work | Parlez-Vous Java? (Web Techniques, Sep 2000)

ASCII, Unicode, and UTF-8

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content