Floating-point datatypes in ASE
Contents:
The floating point datatype was designed to hold a wide
range of values and allow fairly rapid arithmetical operations on them, at
the expense of absolute accuracy. Imprecision is inherent in floating point
datatypes, but the exact nature of the imprecision is a source of much
confusion. This TechNote explains the nature of imprecision in this datatype.
Some aspects of actual floating point implementation, such as the final
two's-complement representation, have been simplified or ignored.
Note:
Sybase did not develop the floating point datatype; it
is a widely used IEEE standard. C or C++ programs on the same platform
as Adaptive Server will demonstrate similar floating point behavior (see
Question 5 in the FAQ section below).
There are two common, standard types of floating point
numbers: 4-byte reals and 8-byte doubles. reals and
doubles store values in similar format: 1 sign bit, x exponent bits,
and y mantissa bits. The only difference is that reals use smaller
exponents and mantissas.
According to the IEEE standard for floating point, real
datatypes have a 23 bit mantissa and a 9 bit exponent (total 32). double
datatypes have a 53 bit mantissa and an 11 bit exponent (total 64).
Note:
The first bit of the mantissa is implicit. It is
not actually stored as it is always on.
Some platforms use different standards. On the VAX, for instance, a
double uses a 56-bit mantissa, an 8-bit exponent, and one sign bit
for a total of 64 bits.
Mantissa and exponent
The mantissa is a binary representation of the number,
each bit representing a power of two. There is an additional implicit bit
at the beginning (left hand side) of the mantissa, which is always on.
The exponent is a power of two that multiplies (or shifts)
the mantissa to represent larger or smaller values. The first bit of the
mantissa represents the value 2^<exponent>, the second bit
2^<exponent-1>, and so on.
After the mantissa bits needed to represent the whole
number part of the number have been used, the fractional part of the number
is represented with the remaining bits (if any), which have values of negative
powers of two.
Example
For a simple demonstration, consider an even smaller
floating point format with a 12 bit mantissa (including one implicit bit),
one sign bit, and four exponent bits for a total of 16 bits.
# [1]########### ####
^ ^ ^
^
| | |
exponent
| | |
| | mantissa
| |
| virtual bit of mantissa (always
on)
|
sign bit
To represent a number:
-
determine "is the number positive or negative?". If it is negative, set
the sign bit.
-
Next determine "what is the smallest power of 2 that is larger
than the number?".
-
Subtract one from that power to find the exponent of the implicit bit in
the mantissa. Store that exponent in the exponent field.
-
Subtract the value of the implicit bit (2^exponent) from the number.
-
Compare 2^(exponent-1) to the remainder.
-
If 2^(exponent-1) is less than the remainder, set the next bit
and subtract the value of that bit from the number.
-
Compare 2^(exponent-2) to the remainder.
-
Repeat in this manner until you run out of mantissa bits.
For instance, to represent 123.3:
The number is positive, so the sign bit is set to 0:
0 [1]########### ####
The smallest power of 2 that is larger than 123.3 is
128, or 2^7, so the exponent is 7-1, or 6. 2^6 is 64, the value of the
implicit mantissa bit:
0 [1]########### 0110
123.3 - 64 is 59.3. 2^5 is 32, which is smaller than
59.3, so the next bit is set:
0 [1]1########## 0110
59.3-32 = 27.3. 2^4 is 16, which is smaller than 27.3,
so the next bit is set:
0 [1]11######### 0110
27.3 - 16 = 11.3. 2^3 is 8, which is smaller than 11.3,
so the next bit is set:
0 [1]111######## 0110
11.3 - 8 = 3.3. 2^2 is 4, which is larger than 3.3, so
the next bit is not set:
0 [1]1110####### 0110
3.3 - 0 = 3.3. 2^1 is 2, which is smaller than 3.3, so
the next bit is set:
0 [1]11101###### 0110
3.3 - 2 = 1.3. 2^0 is 1, which is smaller than 1.3, so
the next bit is set:
0 [1]111011##### 0110
1.3 - 1 = 0.3. 2^-1 is 0.5, which is larger than 0.3, so
the next bit is not set:
0 [1]1110110#### 0110
0.3 - 0 = 0.3. 2^-2 is 0.25, which is smaller than 0.3,
so the next bit is set:
0 [1]11101101### 0110
0.3 - 0.25 = 0.05. 2^-3 is 0.125, which is larger than
0.05, so the next bit is not set:
0 [1]111011010## 0110
0.05 - 0 = 0.05. 2^-4 is 0.06125, which is larger then
0.05, so the next bit is not set:
0 [1]1110110100# 0110
0.05 - 0 = 0.05. 2^-5 is 0.030625, which is smaller than
0.05, so the next bit is set:
0 [1]11101101001 0110
This represents the actual value:
64 + 32 + 16 + 8 + 2 + 1 + 0.25 + 0.030625 = 123.280625
123.3 - 123.280625 is an error of 0.019375.
It may be possible to reduce the error by rounding up to
the next larger number that can be represented (ie, add 2^-5). This works
out to:
0 [1]11101101001 0110
+
1
0 [1]11101101010 0110
64 + 32 + 16 + 8 + 2 + 1 + 0.25 + 0.06125 = 123.31125
123.3 - 123.31125 is an error of -0.01125.
This is a smaller error, so the representation is rounded to:
0 [1]11101101010 0110 (123.31125) as the final
representation.
The standard real and double floating point
formats work exactly the same, except they have wider mantissas that reduce
the magnitude of the potential error, and wider exponents that extend the
possible range of the number.
Frequently Asked Questions
1) Why doesn't round() work right with floating point? There is
garbage in the sixth decimal place when I round a floating point number,
as in:
declare @x real
select @x = 123.31
select @x = round(@x,1)
select @x
go
-------------
123.300003
The decimal rounded value of 123.31 is 123.3, but
the real datatype cannot store 123.3 exactly. The garbage is due
to the imprecision inherent in the floating point format. There is also
a display issue: by default, isql displays floats with 6
digits after the decimal point. Some front-end programs are more intelligent
about this: they know how many digits of the number will be accurate and
truncate or round the display at that point. You can use the T-SQL str()
function for better control over the display of floating point data (str()
is documented under "String Functions" in the ASE manuals.) For example,
here we rounded to one decimal place, so there is no need to display past one
decimal place:
select str(@x,8,1)
go
--------
123.3
(1 row affected)
2) So just how inaccurate are reals and doubles?
4-byte reals can store a number with a maximum
error of (the number)*(2^-23). 8-byte doubles can store
a number with a maximum error of (the number)*(2^-53).
As a rule of thumb, this means that you can expect the
first 7 digits of a real to be correct, and the first 15 digits
of a double to be correct. After that, you may start seeing signs
of inaccuracy or "garbage."
3) When I declare a column or variable to be of type float, there
is an optional [precision] specification. What effect does this have?
If precision is less than 16, the server will use a 4-byte
real.
If the precision is >= 16, the server will use an 8-byte double.
You can explicitly tell the server to use type real or type double
precision. If you do not specify a precision and use float, the server
will default to a double.
Other than this, the [precision] specification has no effect.
The syntax may seem somewhat pointless, but it allows for compatibility
with DDL developed for other systems that interpret [precision] differently.
4) So floating point only has problems storing fractions, right?
That is not true. You can see problems with whole numbers
too. For instance, reals have 23 bits in the mantissa, so they will
have problems with numbers that require more than 23 bits to represent
them correctly.
The smallest value for which we see this is 2^24+1.
reals
can store upto 2^24 with no problem (the implicit bit is on, the
exponent is set to 24, and all the other mantissa bits are zeroed); but
2^24+1
requires 22 zero bits and a final bit following the implicit bit (24 bits
total, only 23 available).
1> select power(2,24)-1, power(2,24), power(2,24)+1
2> go
----------- ----------- -----------
16777215 16777216
16777217
(1 row affected)
1> create table float_test (x real, y real, z real)
2> go
1> insert float_test values (power(2,24)-1, power(2,24),
power(2,24)+1)
2> go
(1 row affected)
1> select * from float_test
2> go
x
y
z
-------------------- -------------------- --------------------
16777215.000000
16777216.000000 16777216.000000
(1 row affected)
Note that the closest representation of 2^24+1
in a real is equal to 2^24.
5) I don't see this behavior in my C/C++ program. What's up?
You are probably not looking hard enough. In general,
printf()
in C and cout in C++ do not print out with enough precision to show
the problems. The imprecision is hidden by the rounding done by the display
process.
Try specifying a higher precision, as in these two sample
programs:
=========================================
For C:
=========================================
main()
/* Program to demonstrate floating point imprecision */
{
float r;
double d ;
r = 123.3;
d = 123.3;
printf("As a %d-byte float (real): 123.3 is %48.24f
\n",
sizeof(r),r);
printf("As a %d-byte double: 123.3 is %48.24f \n", sizeof(d),d);
}
Sample output on Solaris 2.5:
alliance1{bret}125: a.out
As a 4-byte real: 123.3 is
123.300003051757812500000000
As a 8-byte double: 123.3 is
123.299999999999997157829057
=========================================
For C++:
=========================================
#include <iostream.h>
#include <iomanip.h>
main()
-- Program to demonstrate floating point inaccuracy.
{
int precision;
float y;
y = 123.3;
cout << "123.3
as a float printed with increasing precision" <<
endl;
cout <<
"-------------------------------------------------------------"
<< endl;
for (precision = 1;
precision < 30; precision++)
{
cout.precision(precision);
cout <<precision
<< " " << y << endl;
}
double x;
x = 123.3;
cout << endl;
cout << "123.3
as a double, printed with increasing precision"
<< endl;
cout <<
"-------------------------------------------------------------"
<< endl;
for (precision = 1; precision
< 30; precision++)
{
cout.precision(precision);
cout <<precision
<< " " << x << endl;
}
}
Sample output on Solaris 2.5:
alliance1{bret}140: a.out
123.3 as a float printed with increasing precision
-------------------------------------------------------------
1 1e+02
2 1.2e+02
3 123
4 123.3
5 123.3
6 123.3
7 123.3
8 123.3
9 123.300003
10 123.3000031
11 123.30000305
12 123.300003052
13 123.3000030518
14 123.30000305176
15 123.300003051758
16 123.3000030517578
17 123.30000305175781
18 123.300003051757812
19 123.3000030517578125
20 123.3000030517578125
21 123.3000030517578125
22 123.3000030517578125
23 123.3000030517578125
24 123.3000030517578125
25 123.3000030517578125
26 123.3000030517578125
27 123.3000030517578125
28 123.3000030517578125
29 123.3000030517578125
123.3 as a double, printed with increasing precision
-------------------------------------------------------------
1 1e+02
2 1.2e+02
3 123
4 123.3
5 123.3
6 123.3
7 123.3
8 123.3
9 123.3
10 123.3
11 123.3
12 123.3
13 123.3
14 123.3
15 123.3
16 123.3
17 123.3
18 123.299999999999997
19 123.2999999999999972
20 123.29999999999999716
21 123.299999999999997158
22 123.2999999999999971578
23 123.29999999999999715783
24 123.299999999999997157829
25 123.2999999999999971578291
26 123.29999999999999715782906
27 123.299999999999997157829057
28 123.299999999999997157829057
29 123.29999999999999715782905696
6) Where can I find more information on floating points?
Many books on assembly language programming go into great detail.
A search on the World Wide Web for keywords "IEEE" and "floating" will
provide many documents of interest.