2.5 CHARACTER CODE
Unlike real numbers, which have an inﬁnite range, there is only a ﬁnite number of characters. An entire character set can be represented with a small number of bits per character. Three of the most common character representations, ASCII, EBCDIC, and Unicode, are described here.
2.5.1 THE ASCII CHARACTER SET
The American Standard Code for Information Interchange (ASCII) is summarized in Figure 2-13, using hexadecimal indices.
The representation for each character consists of 7 bits, and all 27 possible bit patterns represent valid characters. The characters in positions 00 – 1F and position 7F are special control characters that are used for transmission, printing control, and other non-textual purposes.
The remaining characters are all printable, and include letters, numbers, punctuation, and a space. The digits 0-9 appear in sequence, as do the upper and lower case letters1. This organization simpliﬁes character manipulation. In order to change the character representation of a digit into its numerical value, we can subtract (30)16 from it. In order to convert the ASCII character ‘5,’ which is in position (35)16, into the number 5, we compute (35 – 30 = 5)16.
(1. As an aside, the character ‘a’ and the character ‘A’ are different, and have different codes
in the ASCII table. The small letters like ‘a’ are called lower case, and the capital letters like ‘A’ are called upper case. The naming comes from the positions of the characters in a printer’s typecase. The capital letters appear above the small letters, which resulted in the upper case / lower case naming. These days, typesetting is almost always performed electronically, but the traditional naming is still used.)
In order to convert an upper case letter into a lower case letter, we add (20)16. For example, to convert the letter ‘H,’ which is at location (48)16 in the ASCII table, into the letter ‘h,’ which is at position (68)16, we compute (48 + 20 = 68)16.
2.5.2 THE EBCDIC CHARACTER SET
A problem with the ASCII code is that only 128 characters can be represented, which is a limitation for many keyboards that have a lot of special characters in addition to upper and lower case letters. The Extended Binary Coded Decimal Interchange Code (EBCDIC) is an eight-bit code that is used extensively in IBM
mainframe computers. Since seven-bit ASCII characters are frequently represented in an eight-bit modiﬁed form (one character per byte), in which a 0 or a 1 is appended to the left of the seven-bit pattern, the use of EBCDIC does not place a greater demand on the storage of characters in a computer. For serial transmission, however, (see Chapter 8), an eight-bit code takes more time to transmit than a seven-bit code, and for this case the wider code does make a difference.
The EBCDIC code is summarized in Figure 2-14. There are gaps in the table, which can be used for application speciﬁc characters. The fact that there are gaps in the upper and lower case sequences is not a major disadvantage because character manipulations can still be done as for ASCII, but using different offsets.
2.5.3 THE UNICODE CHARACTER SET
The ASCII and EBCDIC codes support the historically dominant (Latin) character sets used in computers. There are many more character sets in the world, and a simple ASCII-to-language-X mapping does not work for the general case, and so a new universal character standard was developed that supports a great breadth of the world’s character sets, called Unicode.
Unicode is an evolving standard. It changes as new character sets are introduced into it, and as existing character sets evolve and their representations are reﬁned.
In version 2.0 of the Unicode standard, there are 38,885 distinct coded characters that cover the principal written languages of the Americas, Europe, the Middle East, Africa, India, Asia, and Paciﬁca.
The Unicode Standard uses a 16-bit code set in which there is a one-to-one correspondence between 16-bit codes and characters. Like ASCII, there are no complex modes or escape codes. While Unicode supports many more characters than ASCII or EBCDIC, it is not the end-all standard. In fact, the 16-bit Unicode
standard is a subset of the 32-bit ISO 10646 Universal Character Set (UCS-4).
Glyphs for the ﬁrst 256 Unicode characters are shown in Figure 2-15, according to Unicode version 2.1. Note that the ﬁrst 128 characters are the same as for ASCII.