2.3.3 REPRESENTING FLOATING POINT NUMBERS IN THE COMPUTER—PRELIMINARIES
Let us design a simple ﬂoating point format to illustrate the important factors in representing oating point numbers on the computer. Our format may at ﬁrst seem to be unnecessarily complex. We will represent the signiﬁcand in signed magnitude format, with a single bit for the sign bit, and three exadecimal digits for the magnitude. The exponent will be a 3-bit excess-4 number, with a radix of 16.
The normalized form of the number has the hexadecimal point to the left of the three hexadecimal digits.
The bits will be packed together as follows: The sign bit is on the left, followed by the 3-bit exponent, followed by the three hexadecimal digits of the signiﬁcand. Neither the radix nor the hexadecimal point will be stored in the packed form.
The reason for these rather odd-seeming choices is that numbers in this format can be compared for =, ≠, ≤, and ≥ in their “packed” format, which is shown in the illustration below:
Consider representing (358)10 in this format.
The ﬁrst step is to convert the ﬁxed point number from its original base into a ﬁxed point number in the target base. Using the method described in Section 2.1.3, we convert the base 10 number into a base 16 number as shown below:
Thus (358)10 = (166)16. The next step is to convert the ﬁxed point number into a ﬂoating point number:
(166)16 = (166.)16 × 160
Note that the form 160 reﬂects a base of 16 with an exponent of 0, and that the number 16 as it appears on the page uses a base 10 form. That is, (160)10 = (100)16. This is simply a notational convenience used in describing a ﬂoating point number.
The next step is to normalize the number:
(166.)16 × 160 = (.166)16 × 163
Finally, we ﬁll in the bit ﬁelds of the number. The number is positive, and so we place a 0 in the sign bit position. The exponent is 3, but we represent it in excess 4, so the bit pattern for the exponent is computed as shown below:
Alternatively, we could have simply computed 3 + 4 = 7 in base 10, and then made the equivalent conversion (7)10 = (111)2.
Finally, each of the base 16 digits is represented in binary as 1 = 0001, 6 = 0110, and 6 = 0110. The ﬁnal bit pattern is shown below:
Notice again that the radix point is not explicitly represented in the bit pattern, but its presence is implied. The spaces between digits are for clarity only, and do not suggest that the bits are stored with spaces between them. The bit pattern as stored in a computer’s memory would look like this:
The use of an excess 4 exponent instead of a two’s complement or a signed magnitude exponent simpliﬁes addition and subtraction of ﬂoating point numbers (which we will cover in detail in Chapter 3). In order to add or subtract two normalized ﬂoating point numbers, the smaller exponent (smaller in degree, not magnitude) must ﬁrst be increased to the larger exponent (this retains the range), which also has the effect of unnormalizing the smaller number. In order to determine which exponent is larger, we only need to treat the bit patterns as unsigned
numbers and then make our comparison. That is, using an excess 4 representation, the smallest exponent is −4, which is represented as 000. The largest exponent is +3, which is represented as 111. The remaining bit patterns for −3, −2, −1, 0, +1, and +2 fall in their respective order as 001, 010, 011, 100, 101, and 110.
Now if we are given the bit pattern shown above for (358)10 along with a description of the ﬂoating point representation, then we can easily determine the number. The sign bit is a 0, which means that the number is positive. The exponent in unsigned form is the number (+7)10, but since we are using excess 4, we must subtract 4 from it, which results in an actual exponent of (+7 − 4 = +3)10.
The fraction is grouped in four-bit hexadecimal digits, which gives a fraction of (.166)16. Putting it all together results in (+.166 × 163)16 = (358)10. Now suppose that only 10 bits are allowed for the fraction in the above example, instead of the 12 bits that group evenly into fours for hexadecimal digits. How does the representation change? One approach might be to round the fraction and adjust the exponent as necessary. Another approach, which we use here, is to simply truncate the least signiﬁcant bits by chopping and avoid making adjustments to the exponent, so that the number we actually represent is:
If we treat the missing bits as 0’s, then this bit pattern represents (.164 × 163)16.This method of truncation produces a biased error, since values of 00, 01, 10, and 11 in the missing bits are all treated as 0, and so the error is in the range from 0 to (.003)16. The bias comes about because the error is not symmetric about 0. We will not explore the bias problem further here, but a more thorough discussion can be found in (Hamacher et al., 1990).
We again stress that whatever the ﬂoating point format is, that it be known to all parties that intend to store or retrieve numbers in that format. The Institute of Electrical and Electronics Engineers (IEEE), has taken the lead in standardizing ﬂoating point formats. The IEEE 754 ﬂoating point format, which is in nearly universal usage, is discussed in Section 2.3.5.