### Floating Point Arithmetic

```Floating Point Format
What do floating-point numbers represent?
• Rational numbers with non-repeating expansions
in the given base within the specified exponent range.
• They do not represent repeating rational or irrational
numbers, or any number too small or too large.
CMPE12c
1
Gabriel Hugh Elkaim
IEEE Double Precision FP
• IEEE Double Precision is similar to SP
– 52-bit M
• 53 bits of precision with hidden bit
– 11-bit E, excess 1023, representing –1022 <- -> 1023
– One sign bit
• Always use DP unless memory/file size is important
– SP ~ 10-38 … 1038
– DP ~ 10-308 … 10308
• Be very careful of these ranges in numeric
computation
CMPE12c
2
Gabriel Hugh Elkaim
Floating Point Arithmetic
Floating Point operations include
•Subtraction
•Multiplication
•Division
They are complicated because…
CMPE12c
3
Gabriel Hugh Elkaim
Decimal Review
+
9.997
4.631
9.997
+ 0.004631
10.001631
x 102
x 10-1
How do we do this?
CMPE12c
1. Align decimal points
x 102
x 102
x 102
3. Normalize the result
• Otherwise move one digit
1.0001631 x 103
4. Round result
1.000 x 103
4
Gabriel Hugh Elkaim
Example: 0.25 + 100 in SP FP
First step: get into SP FP if not already
.25 = 0 01111101 00000000000000000000000
100 = 0 10000101 10010000000000000000000
Or with hidden bit
.25 = 0 01111101 1 00000000000000000000000
100 = 0 10000101 1 10010000000000000000000
Hidden Bit
CMPE12c
5
Gabriel Hugh Elkaim
–
–
–
–
CMPE12c
Shifting F left by 1 bit, decreasing e by 1
Shifting F right by 1 bit, increasing e by 1
Shift F right so least significant bits fall off
Which of the two numbers should we shift?
6
Gabriel Hugh Elkaim
Second step: Align radix points cont.
Shift the .25 to increase its exponent so it matches
that of 100.
0.25’s e:
01111101 – 1111111 (127) =
100’s e: 10000101 – 1111111 (127) =
Shift .25 by 8 then.
Easier method: Bias cancels with subtraction, so
10000101
100’s E
- 01111101
0.25’s E
00001000
CMPE12c
7
Gabriel Hugh Elkaim
Carefully shifting the 0.25’s fraction
•
•
•
•
•
•
•
•
•
S
0
0
0
0
0
0
0
0
0
CMPE12c
E
HB
01111101 1
01111110 0
01111111 0
10000000 0
10000001 0
10000010 0
10000011 0
10000100 0
10000101 0
F
00000000000000000000000
10000000000000000000000
01000000000000000000000
00100000000000000000000
00010000000000000000000
00001000000000000000000
00000100000000000000000
00000010000000000000000
00000001000000000000000
8
(original value)
(shifted by 1)
(shifted by 2)
(shifted by 3)
(shifted by 4)
(shifted by 5)
(shifted by 6)
(shifted by 7)
(shifted by 8)
Gabriel Hugh Elkaim
Third Step: Add fractions with hidden bit
0 10000101 1 10010000000000000000000 (100)
0 10000101 0 00000001000000000000000 (.25)
0 10000101 1 10010001000000000000000
+
Fourth Step: Normalize the result
•
•
•
CMPE12c
Get a ‘1’ back in hidden bit
Already normalized most of the time
Remove hidden bit and finished
9
Gabriel Hugh Elkaim
Normalization example
+
S
0
0
0
E
011
011
011
HB
1
1
11
F
1100
1011
0111
Need to shift so that only a 1 in HB spot
0
CMPE12c
100 1
10
Gabriel Hugh Elkaim
Floating Point Example
• 0xD4F80000 + 0x56B00000
CMPE12c
11
Gabriel Hugh Elkaim
CMPE12c
12
Gabriel Hugh Elkaim
Another SP FP Example
• 0xD5D00000 + 0x54600000
CMPE12c
13
Gabriel Hugh Elkaim
CMPE12c
14
Gabriel Hugh Elkaim
Floating Point Subtraction
•Mantissa’s are sign-magnitude
•Watch out when the numbers are close
-
1.23455
1.23456
x 102
x 102
•A many-digit normalization is possible
This is why FP addition is in many ways more
difficult than FP multiplication
CMPE12c
15
Gabriel Hugh Elkaim
Floating Point Subtraction
Steps to do subtraction
2. Perform sign-magnitude operand swap if
needed
• Compare magnitudes (with hidden bit)
• Change sign bit if order of operands is
changed.
3. Subtract
4. Normalize
5. Round
CMPE12c
16
Gabriel Hugh Elkaim
Floating Point Subtraction
Simple Example:
-
S
0
0
E
011
011
HB
1
1
F
1011
1101
switch order and make result negative
0
011
1
1101
- 0
011
1
1011
1
011
0
0010
1
000
1
0000
CMPE12c
17
smaller
bigger
bigger
smaller
switched sign
Gabriel Hugh Elkaim
Floating Point Multiplication
Decimal example:
3.0 x 101
x 5.0 x 102
How do we do this?
CMPE12c
1. Multiply mantissas
3.0
x 5.0
15.00
1+2=3
3. Combine
15.00 x 103
4. Normalize if needed
1.50 x 104
18
Gabriel Hugh Elkaim
Floating Point Multiplication
Multiplication in binary (4-bit F)
x
0 10000100 0100
1 00111100 1100
Step 1: Multiply mantissas
(put hidden bit back first!!)
10.00110000
CMPE12c
19
1.0100
x
1.1100
00000
00000
10100
10100
+ 10100
1000110000
Gabriel Hugh Elkaim
Floating Point Multiplication
Second step: Add exponents, subtract extra bias.
11000000
- 01111111 (127)
10000100
+ 00111100
01000001
11000000
Third step: Renormalize, correcting exponent
1 01000001
10.00110000
Becomes
1 01000010
1.000110000
Fourth step: Drop the hidden bit
1 01000010
000110000
CMPE12c
20
Gabriel Hugh Elkaim
Floating Point Multiplication
Multiply these SP FP numbers together
x
0x49FC0000
0x4BE00000
CMPE12c
21
Gabriel Hugh Elkaim
CMPE12c
22
Gabriel Hugh Elkaim
CMPE12c
23
Gabriel Hugh Elkaim
Another SP FP Example
• 0xC9F4 × 0x484F
CMPE12c
24
Gabriel Hugh Elkaim
CMPE12c
25
Gabriel Hugh Elkaim
Floating Point Division
•True division
•Unsigned, full-precision division on mantissas
•This is much more costly (e.g. 4x) than mult.
•Subtract exponents
•Faster division
•Newton’s method to find reciprocal
•Multiply dividend by reciprocal of divisor
•May not yield exact result without some work
•Similar speed as multiplication
CMPE12c
26
Gabriel Hugh Elkaim
Questions?
CMPE12c
27
Gabriel Hugh Elkaim
```