**2.3.5 THE IEEE 754 FLOATING POINT STANDARD**

There are many ways to represent ﬂoating point numbers, a few of which we have already explored. Each representation has its own characteristics in terms of range, precision, and the number of representable numbers. In an effort to improve software portability and ensure uniform accuracy of ﬂoating point calculations, the IEEE 754 ﬂoating point standard for binary numbers was developed (IEEE, 1985).

There are a few entrenched product lines that predate the standard that do not use it, such as the IBM/370, the DEC VAX, and the Cray line, but virtually all new architectures generally provide some level of IEEE 754 support.

The IEEE 754 standard as described below must be supported by a computer system, and not necessarily by the hardware entirely. That is, a mixture of hardware and software can be used while still conforming to the standard.

**2.3.5.1 Formats**

There are two primary formats in the IEEE 754 standard: single precision and double precision. Figure 2-10 summarizes the layouts of the two formats. The single precision format occupies 32 bits, whereas the double precision format occupies 64 bits. The double precision format is simply a wider version of the single precision format.

Figure 2-10 Single precision and double precision IEEE 754 floating point formats.

The sign bit is in the leftmost position and indicates a positive or negative number for a 0 or a 1, respectively. The 8-bit excess 127 (not 128) exponent follows, in which the bit patterns 00000000 and 11111111 are reserved for special cases, as described below. For double precision, the 11-bit exponent is represented in excess 1023, with 00000000000 and 11111111111 reserved. The 23-bit base 2 fraction follows. There is a hidden bit to the left of the binary point, which when taken together with the single-precision fraction form a 23 + 1 = 24-bit signiﬁcand of the form 1.fff...f where the fff...f pattern represents the 23-bit fractional part that is stored. The double-precision format also uses a hidden bit to the left of the binary point, which supports a 52 + 1 = 53 bit signiﬁcand. For both formats, the number is normalized unless denormalized numbers are supported, as described later.

There are ﬁve basic types of numbers that can be represented. Nonzero normalized numbers take the form described above. A so-called “clean zero” is represented by the reserved bit pattern 00000000 in the exponent and all 0’s in the fraction. The sign bit can be 0 or 1, and so there are two representations for zero: +0 and −0.

Inﬁnity has a representation in which the exponent contains the reserved bit pattern 11111111, the fraction contains all 0’s, and the sign bit is 0 or 1. Inﬁnity is useful in handling overﬂow situations or in giving a valid representation to a number (other than zero) divided by zero. If zero is divided by zero or inﬁnity is divided by inﬁnity, then the result is undeﬁned. This is represented by the NaN (not a number) format in which the exponent contains the reserved bit pattern 11111111, the fraction is nonzero and the sign bit is 0 or 1. A NaN can also be produced by attempting to take the square root of −1.

As with all normalized representations, there is a large gap between zero and the ﬁrst representable number. The denormalized, “dirty zero” representation allows numbers in this gap to be represented. The sign bit can be 0 or 1, the exponent contains the reserved bit pattern 00000000 which represents −126 for single precision (−1022 for double precision), and the fraction contains the actual bit pattern for the magnitude of the number. Thus, there is no hidden 1 for this format.

Note that the denormalized representation is not an unnormalized representation. The key difference is that there is only one representation for each denormalized number, whereas there are inﬁnitely many unnormalized representations.

Figure 2-11 illustrates some examples of IEEE 754 ﬂoating point numbers.

Figure 2-11 Examples of IEEE 754 floating point numbers in single precision

format (a – h) and double precision format (i). Spaces are shown for

clarity only: they are not part of the representation.

Examples (a) through (h) are in single precision format and example (i) is in double precision format. Example (a) shows an ordinary single precision number.

Notice that the signiﬁcand is 1.101, but that only the fraction (101) is explicitly represented. Example (b) uses the smallest single precision exponent (–126) and example (c) uses the largest single precision exponent (127).

Examples (d) and (e) illustrate the two representations for zero. Example (f ) illustrates the bit pattern for +∞. There is also a corresponding bit pattern for –∞.

Example (g) shows a denormalized number. Notice that although the number itself is 2−128, the smallest representable exponent is still −126. The exponent for single precision denormalized numbers is always −126, which is represented by the bit pattern 00000000 and a nonzero fraction. The fraction represents the magnitude of the number, rather than a signiﬁcand. Thus we have +2−128 = +.01 × 2–126, which is represented by the bit pattern shown in Figure 2-11g.

Example (h) shows a single precision NaN. A NaN can be positive or negative.

Finally, example (i) revisits the representation of 2–128 but now using double precision. The representation is for an ordinary double precision number and so there are no special considerations here. Notice that 2–128 has a signiﬁcand of 1.0, which is why the fraction ﬁeld is all 0’s.

In addition to the single precision and double precision formats, there are also single extended and double extended formats. The extended formats are not visible to the user, but they are used to retain a greater amount of internal precision during calculations to reduce the effects of roundoff errors. The extended formats increase the widths of the exponents and fractions by a number of bits that can vary depending on the implementation. For instance, the single extended format adds at least three bits to the exponent and eight bits to the fraction. The double extended format is typically 80 bits wide, with a 15-bit exponent and a 64-bit fraction.

**2.3.5.2 Rounding**

An implementation of IEEE 754 must provide at least single precision, whereas the remaining formats are optional. Further, the result of any single operation on ﬂoating point numbers must be accurate to within half a bit in the least signiﬁcant bit of the fraction. This means that some additional bits of precision may need to be retained during computation (referred to as guard bits), and there must be an appropriate method of rounding the intermediate result to the number of bits in the fraction.

There are four rounding modes in the IEEE 754 standard. One mode rounds to 0, another rounds toward +∞, and another rounds toward −∞. The default mode rounds to the nearest representable number. Halfway cases round to the number whose low order digit is even. For example, 1.01101 rounds to 1.0110 whereas 1.01111 rounds to 1.1000.

## No comments:

## Post a Comment