3.4.1 FLOATING POINT ADDITION AND SUBTRACTION
Floating point arithmetic differs from integer arithmetic in that exponents must be handled as well as the magnitudes of the operands. As in ordinary base 10 arithmetic using scientiﬁc notation, the exponents of the operands must be made equal for addition and subtraction. The fractions are then added or subtracted as appropriate, and the result is normalized.
This process of adjusting the fractional part, and also rounding the result can lead to a loss of precision in the result. Consider the unsigned ﬂoating point addition (.101 × 23 + .111 × 24) in which the fractions have three signiﬁcant digits. We start by adjusting the smaller exponent to be equal to the larger exponent, and adjusting the fraction accordingly.
Thus we have .101 × 23 = .010 × 24, losing .001 × 23 of precision in the process. The resulting sum is
(.010 + .111) × 24 = 1.001 × 24 = .1001 × 25,
and rounding to three signiﬁcant digits, .100 × 25, and we have lost another 0.001 × 24 in the rounding process.
Why do ﬂoating point numbers have such complicated formats?
We may wonder why ﬂoating point numbers have such a complicated structure, with the mantissa being stored in signed magnitude representation, the exponent stored in excess notation, and the sign bit separated from the rest of the magnitude by the intervening exponent ﬁeld. There is a simple explanation for this structure. Consider the complexity of performing ﬂoating point arithmetic in a computer. Before any arithmetic can be done, the number must be unpacked from the form it takes in storage. (See Chapter 2 for a description of the IEEE
754 ﬂoating point format.) The exponent and mantissa must be extracted from the packed bit pattern before an arithmetic operation can be performed; after the arithmetic operation(s) are performed, the result must be renormalized and rounded, and then the bit patterns are re-packed into the requisite format.
The virtue of a ﬂoating point format that contains a sign bit followed by an exponent in excess notation, followed by the magnitude of the mantissa, is that two ﬂoating point numbers can be compared for >, <, and = without unpacking.
The sign bit is most important in such a comparison, and it appropriately is the MSB in the ﬂoating point format. Next most important in comparing two numbers is the exponent, since a change of ± 1 in the exponent changes the value by a factor of 2 (for a base 2 format), whereas a change in even the MSB of the fractional part will change the value of the ﬂoating point number by less than that.
In order to account for the sign bit, the signed magnitude fractions are represented as integers and are converted into two’s complement form. After the addition or subtraction operation takes place in two’s complement, there may be a need to normalize the result and adjust the sign bit. The result is then converted back to signed magnitude form.