Floating-point Algorithms
• addition algorithm (simplified):
(1) Compare the exponents of the two numbers. Shift the smaller number to the right until its
exponent would match the larger exponent
(2) Add the mantissa
(3) Normalize the sum if needed (shift right/left the exponent by 1)
• multiplication algorithm (simplified):
(1) Multiplication of mantissas. The number of bits of the result is twice the size of the operands
(46 + 2 bits, with +2 for implicit normalization)
(2) Normalize the product if needed (shift right/left the exponent by 1)
(3) Addition of the exponents
• fused multiply-add (fma):
• Recent architectures (also GPUs) provide fma to compute addition and multiplication in a single
instruction (performed by the compiler in most cases)
• The rounding error of fma(x, y, z) is less than (x ⊗ y) ⊕ z
71/83