Sidebar
ASCII, Unicode, and UTF-8
How do your programs store characters? For years, the answer was the ASCII format, but that's no longer true. Internally, Java uses the Unicode format. Unicode is a standard 16-bit character set used to represent glyphs for nearly every known language and a number of extra symbols.
For external data, Java uses an encoding scheme known as UTF-8. This is a particular way of storing characters in which the initial bit pattern determines the number of bytes in the character. (Remember that there are eight bits in a byte, numbered from 0 to 7.) Having this variable bit pattern lets you store data in an efficient but versatile manner. There are three types of characters possible in the UTF-8 scheme: one-byte, two-byte, and three-byte.
One-byte characters: If bit number 7 of the first byte is set to 0, then the character is made up of only one byte.
Two-byte characters: If the first three bits (numbers 7, 6, and 5) are set to 110 (binary), then the character consists of two bytes. In this case, the second byte must begin with 10 (binary), which leaves 11 bits remaining to define the character.
Three-byte characters: If the character requires more than 11 significant bits, the UTF-8 scheme says that Java must use three bytes to store it. In this case, the first byte must start with 1110 (binary). The next two bytes each start with 10 (binary). This lets you store the full 16 bits.
The UTF scheme has several advantages. All ASCII files are already proper UTF-8 files, so you don't have to convert any existing data. In addition, because of the bit patterns it's easy to recognize whether a byte starts a sequence, belongs to a sequence, or is its own character by looking at the starting bits. (Any byte that's part of a sequence starts with 10 (binary); any byte that does not start with 10 (binary) either begins a sequence or is a single byte.)
AW