Floating-Point Arithmetic
Even the most powerful computer computes using a vanishingly tiny subset of all possible numbers. This follows from the fact that there are infinitely many numbers, so to specify every possible number would require an infinitely long string of bits (ones and zeroes), which no real computer can contain. Therefore some system must be adopted that allows us to compute adequately using a finite number of numbers. Two such systems--fixed-point arithmetic and floating-point arithmetic--are in common use.
The number of numbers that a register (computer circuit for remembering a string of bits) can represent depends on how many bits the register contains; N bits can represent 2N numbers. Which 2N numbers they represent is up to the designer. Consider a 16-bit word, which can symbolize 216 (or 65,536) numbers. If we wish to be able to count from 1 to 65,536, we might let each possible combination of 16 bits stand for a distinct integer: 1, 2, 3, and so on. If we wish to build a digital cash register, we might use our 16 bits to count by hundredths of a dollar: $.00, $.01, $.02, and so on, all the way to $655.35.
These two numbering schemes (counting by ones and by hundredths) are examples of fixed-point systems. The "point" referred to is the radix point, familiar in the decimal system as the "decimal" point. To the left of the radix point, a number's digits multiply increasing powers of the radix (or base) of the number system (10, in the decimal case). In the decimal number 25.39, for example, the 5 to the left of the radix point multiples 100 (= 1) and the 2 multiples 101 (= 10). To the right of the radix point, digits multiply increasingly negative powers of the radix: here, 3 multiplies 10-1 (= .1) and 9 multiplies 10-2 (= .01). With 2 substituted for 10, similar statements would apply to binary (base 2) numbers. However, a string of bits is not necessarily a base-2 number. It may be interpreted as a base-2 number--or as something else.
The position of the point in a fixed-point number system is, as advertised, fixed. In our cash-counting decimal number system, the radix point is fixed two digits in from the right. Fixed-point arithmetic is arithmetic that utilizes such numbers. Note that a fixed-point number system specifies a finite set of numbers evenly spaced along a finite interval on the number line. In the examples already given, all the integers between 1 and 65,536 and all the hundredths between 0 and 655.35 are specified. This even-spacing property is a good thing in a context where we care just as much about the differences between large numbers as we do about those between small ones, as we do when counting cash. (The difference between $655.34 and $655.35 is worth just as much as the difference between $0.00 and $0.01.) In many contexts, however, we desire greater precision when handling small numbers than when handling large ones.
Consider voltages. In measuring a voltage of, say, 1,000,000 volts, we do not need to record its value to a millionth of a volt: even if obtainable, such information would probably be meaningless. Yet in measuring a voltage of .012 thousandths of a volt, a millionth of a volt is important. Unlike money, voltage is continuous (i.e., may take on any value at all, at least on the macro scale). When changes in such a variable are small, we desire a number system with high resolution so we can record them; when they are large, we can generally afford to sacrifice resolution for range. Many physical variables (velocity, mass, temperature, etc.) are continuous, hence, in computer design, the need for a number system that (unlike fixed-point arithmetic) resolves small values finely without wasting bits on resolving large values finely also. Floating-point arithmetic meets this need.
Floating-point numbers are essentially a binary version of ordinary scientific notation. A number written in scientific notation consists of three parts: the significand (or "fraction," often mistakenly called the "mantissa"), the radix, and the exponent. In the number -.23345 x 102, for instance, the significand is -.23345, the radix is 10, and the exponent is 2. In floating-point computer arithmetic, a 16-bit, 32-bit, or longer string of bits is divided into two sub-strings, one to represent the significand and the other to represent the exponent; the radix is 2 by definition.
Consider a 16-bit word divided up into significand (M) and exponent (E) fields that are nine and seven bits long, respectively:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
M M M M M M M M M E E E E E E E
Since there are seven E bits, each of which has two possible values, there are 27 or 128 possible exponents. The first bit of the exponent is usually devoted to marking it positive or negative, so half the 128 representable exponents will be positive and the other half negative. Similarly, the nine M bits (including one sign bit) can represent 29 or 512 significands, half positive and half negative. A 16-bit floating-point word can therefore represent 128 x 512 = 65,536 = 216 numbers, as expected. The point is said to be "floating" in such a system because the location of the radix point depends on the value of the exponent. (Compare 1 x 101 = 10.00 to 1 x 102 = 100.0.)
The rules for floating-point arithmetic resemble those for manipulating scientific notation. In multiplying two floating-point numbers, for example, the significands are multiplied, the exponents are added, and both terms are adjusted to produce a significand with its radix point fixed in the proper place. Computers implement these rules at the register level, so the user is usually free to ignore them.
Some textbooks state misleadingly that the advantage of floating-point arithmetic is that it can represent a greater "range" of numbers than fixed-point arithmetic. Yet fixed-point numbers are not range-limited by nature. An N-bit fixed-point number can specify 2N numbers over any range: 0 to 101,000, or 0 to 1, or -2 to +9, or any other finite interval. What sets apart the 2N numbers specified by a standard floating-point format is that half of them--those with negative exponents--are crowded into the space between -1 and +1, while the rest are spread out over a far wider interval (-1019 to +1019 for the 16-bit format shown above--and remember, 1019 = 10,000,000,000,000,000,000). It is thus the dynamic range--the ratio between the smallest value the number system can represent and the largest it can represent--that is so much greater for floating-point arithmetic than for fixed-point arithmetic.
Floating-point arithmetic remains an area of active research. The nagging issue is round-off error. Any number that does not happen to be one of the numbers exactly representable by a floating-point system--and there are infinitely many such numbers, including --must be approximated by the nearest exactly representable number; this substitution destroys information. In long chains of computations, such as those carried out in simulations of physical phenomena, these errors may accumulate and become a significant problem.
This is the complete article, containing 1,178 words
(approx. 4 pages at 300 words per page).