Modern C++

Programming

4. Basic Concepts II

Integral and Floating-point Types

Federico Busato

2025-04-14

Table of Contents

1 Integral Data Types

Fixed Width Integers

size_t

ptrdiff_t

⋆

uintptr_t

⋆

Arithmetic Operation Semantics

Promotion, Truncation

Undeﬁned Behavior

Saturation Arithmetic

⋆

1/83

Table of Contents

2 Floating-point Types and Arithmetic

IEEE Floating-point Standard and Other Representations

Normal/Denormal Values

Inﬁnity (∞)

Not a Number (NaN)

Machine Epsilon

Units at the Last Place (ULP)

Cheatsheet

Limits and Useful Functions

2/83

Table of Contents

Arithmetic Properties

Special Values Behavior

Floating-Point Undeﬁned Behavior

Detect Floating-point Errors

⋆

3 Floating-point Issues

Catastrophic Cancellation

Floating-point Comparison

3/83

Integral Data Types

A Firmware Bug

“Certain SSDs have a ﬁrmware bug causing them to irrecoverably fail after

exactly 32,768 hours of operation. SSDs that were put into service at the

same time will fail simultaneously, so RAID won’t help”

HPE SAS Solid State Drives - Critical Firmware Upgrade

Via twitter.com/martinkl/status/1202235877520482306?s=19

4/83

Overﬂow Implementations

Note: Computing the average in the right way is not trivial, see On finding the average

of two unsigned integers without overflow

related operations: ceiling division, rounding division

ai.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html

5/83

Potentially Catastrophic Failure

51 days = 51 · 24 · 60 · 60 ·1000 = 4 406 400 000 ms

Boeing 787s must be turned off and on every 51 days to prevent ‘misleading data’

being shown to pilots

6/83

C++ Data Model

Number of Bits

C++ Data

Model

OS short int long long long pointer

ILP32 Windows/Unix 32-bit 16 32 32 64 32

LLP64 Windows 64-bit 16 32 32 64 64

LP64 Linux 64-bit 16 32 64 64 64

char is always 1 byte

LP32: Windows 16-bit APIs (no more used)

C++ Fundamental types

7/83

Fixed Width Integers 1/3

int*_t <cstdint>

C++11 provides ﬁxed width integer types.

They have the same size on any architecture:

int8_t, uint8_t

int16_t, uint16_t

int32_t, uint32_t

int64_t, uint64_t

Good practice: Prefer ﬁxed-width integers instead of native types. int and

unsigned can be directly used as they are widely accepted by C++ data models

8/83

Fixed Width Integers 2/3

int*_t types are not “real” types, they are merely typedefs to appropriate

fundamental types

C++ standard does not ensure a one-to-one mapping:

• There are ﬁve distinct fundamental types ( char , short , int , long ,

long long )

• There are four int*_t overloads ( int8_t , int16_t , int32_t , and

int64_t )

ithare.com/c-on-using-int_t-as-overload-and-template-parameters

9/83

Fixed Width Integers 3/3

Warning: I/O Stream interprets uint8_t and int8_t as char and not as integer

values

int8_t var;

cin >> var; // read '2'

cout << var; // print '2'

int a = var * 2;

cout << a; // print '100' !!

10/83

size_t

size_t <cstddef>

size_t  is an alias data type capable of storing the biggest representable value on

the current architecture

• size_t is an unsigned integer type (of at least 16-bit)

• size_t is the return type of sizeof() and commonly used to represent size

measures

• size_t is 4 bytes on 32-bit architectures, and 8 bytes on 64-bit architectures

• C++23 adds uz / UZ literals for size_t , e.g. 5uz

11/83

ptrdiff_t

⋆

ptrdiff_t <cstddef>

ptrdiff_t  is an aliases data type used to store the results of the diﬀerence

between pointers or iterators

• ptrdiff_t is the signed version of size_t commonly used for computing

pointer diﬀerences

• ptrdiff_t are 4 bytes on 32-bit architectures, and 8 bytes on 64-bit

architectures

• C++23 adds z / Z for ptrdiff_t , e.g. 5z

12/83

uintptr_t

⋆

uintptr_t <cstdint>

uintptr_t  (C++11) is an integer type that can be converted from and to a

void pointer

• uintptr_t is an unsigned type

• sizeof(uintptr_t) == sizeof(void*)

• uintptr_t is an optional type of the standard and compilers may not provide it

13/83

Arithmetic Operation Semantics

Overﬂow The result of an arithmetic operation exceeds the word length, namely the

largest positive/negative values

Wraparound The result of an arithmetic operation is reduced modulo 2

where N is the

number of bits of the word

Saturation The result of an arithmetic operation is constrained within a ﬁxed range

deﬁned by a minimum and maximum value. If the result of an operation

exceeds this range, it is “clamped” to the boundary value

14/83

Signed/Unsigned Integer Characteristics

Without undeﬁned behavior, signed and unsigned integers use the same exact

hardware, and they are equivalent at binary level thanks to the two-complement

representation

#include <cstdint>

int a1 = INT_MAX;

int b1 = a1 + 4; // 10000000000000000000000000000011

unsigned a2 = INT_MAX;

unsigned b2 = a2 + 4; // 10000000000000000000000000000011

However, signed and unsigned integers have diﬀerent semantics in C++. The

compiler can exploit undeﬁned behavior to optimize the code even if such operations

are well-deﬁned at hardware level

15/83

Signed Integer

Represent positive, negative, and zero values (Z

Properties: Commutative, reﬂexive, not associative (overﬂow/underﬂow)

Represent the human intuition of numbers

All bitwise operations are well-deﬁned, except shift

16/83

Signed Integer - Problems

More negative values (2

− 1) than positive (2

− 2)

Even multiply, division, and modulo by -1 can fail, e.g. INT_MIN * -1

Overﬂow/underﬂow semantic → undeﬁned behavior

Possible behavior: overﬂow: (2

− 1) + 1 → min, underﬂow: −2

− 1 → max

Shift could lead to undeﬁned behavior x ≪ y

• undeﬁned behavior if y is larger than the number of bits of x

• implementation-deﬁned if x is negative (until C++20)

• undeﬁned behavior if y is negative

17/83

Unsigned Integer

Represent only non-negative values (N

Properties: commutative, reﬂexive, associative

Discontinuity in 0, 2

− 1

Wraparound semantic → well-deﬁned (modulo 2

)

Bit-wise operations are well-deﬁned, except shift

Shift could lead to undeﬁned behavior x ≪ y

• undeﬁned behavior if y is larger than the number of bits of x

18/83

When Use Signed/Unsigned Integer? 1/2

Google Style Guide

Because of historical accident, the C++ standard also uses unsigned integers to

represent the size of containers - many members of the standards body believe this

to be a mistake, but it is eﬀectively impossible to ﬁx at this point

Solution: use int64_t

max value: 2

− 1 = 9,223,372,036,854,775,807 or

9 quintillion (9 billion of billion),

about 292 years in nanoseconds,

9 million terabytes

19/83

When Use Signed/Unsigned Integer? 2/2

When use signed integer?

• if it can be mixed with negative values, e.g. subtracting byte sizes

• prefer expressing non-negative values with signed integer and assertions

• optimization purposes, e.g. exploit undeﬁned behavior for overﬂow or in loops

When use unsigned integer?

• if the quantity can never be mixed with negative values (?)

• bitmask values

• optimization purposes, e.g. division, modulo

• safety-critical system, signed integer overﬂow could be “non-deterministic”

Subscripts and sizes should be signed, Bjarne Stroustrup

Don’t add to the signed/unsigned mess, Bjarne Stroustrup

Integer Type Selection in C++: in Safe, Secure and Correct Code, Robert C. Seacord

20/83

Arithmetic Type Limits

Query properties of arithmetic types in C++11:

# include <limits>

std::numeric_limits<int>::max(); // 2

− 1

std::numeric_limits<uint16_t>::max(); // 65, 535

std::numeric_limits<int>::min(); // −2

std::numeric_limits<unsigned>::min(); // 0

* this syntax will be explained in the next lectures

21/83

Promotion and Truncation

Promotion to a larger type keeps the sign

int16_t x = -1;

int y = x; // sign extend

cout << y; // print -1

Truncation to a smaller type is implemented as a modulo operation with respect to

the number of bits of the smaller type

int x = 65537; // 2^16 + 1

int16_t y = x; // x % 2^16

cout << y; // print 1

int z = 32769; // 2^15 + 1 (does not fit in a int16_t)

int16_t w = z; // (int16_t) (x % 2^16 = 32769)

cout << w; // print -32767

22/83

Mixing Signed/Unsigned Errors 1/3

unsigned a = 10; // array is small

int b = -1;

array[10ull + a * b] = 0; // ?

Segmentation fault!

int f(int a, unsigned b, int* array) { // array is small

if (a > b)

return array[a - b]; // ?

return 0;

}

Segmentation fault for a < 0 !

// v.size() return unsigned

for (size_t i = 0; i < v.size() - 1; i++)

array[i] = 3; // ?

Segmentation fault for v.size() == 0 !

23/83

Mixing Signed/Unsigned Errors  2/3

Easy case:

unsigned x = 32; // x can be also a pointer

x += 2u - 4; // 2u - 4 = 2 + (2^32 - 4)

// = 2^32 - 2

// (32 + (2^32 - 2)) % 2^32

cout << x; // print 30 (as expected)

What about the following code?

uint64_t x = 32; // x can be also a pointer

x += 2u - 4;

cout << x;

24/83

Mixing Signed/Unsigned Errors  3/3

A real-world case:

// allocate a zero rtx vector of N elements

// sizeof(struct rtvec_def) == 16

// sizeof(rtunion) == 8

rtvec rtvec_alloca(int n) {

rtvec rt;

int i;

rt = (rtvec)obstack_alloc(

rtl_obstack,

sizeof(struct rtvec_def) + ((n - 1) * sizeof(rtunion)));

// ...

return rt;

}

Garbage In, Garbage Out: Arguing about Undefined Behavior with Nasal Daemons,

Chandler Carruth, CppCon 2016

25/83

Undeﬁned Behavior 1/7

The C++ standard does not prescribe any speciﬁc behavior (undeﬁned behavior) for

several integer/unsigned arithmetic operations

• Signed integer overﬂow/underﬂow

int x = std::numeric_limits<int>::max() + 20;

• More negative values than positive

int x = std::numeric_limits<int>::max() * -1; // (2^31 -1) * -1

cout << x; // -2^31 +1 ok

int y = std::numeric_limits<int>::min() * -1; // -2^31 * -1

cout << y; // hard to see in complex examples // 2^31 overflow!!

The Usual Arithmetic Confusions

26/83

Undeﬁned Behavior 2/7

• Initialize an integer with a value larger than its range is undeﬁned behavior

int z = 3000000000; // undefined behavior!!

• Bitwise operations on signed integer types with negative value is undeﬁned

behavior

int y = -1 << 12; // undefined behavior!!, until C++20

int z = 1 << -12; // undefined behavior!!

• Shift larger than #bits of the data type is undeﬁned behavior even for unsigned

unsigned y = 1u << 32u; // undefined behavior!!

• Undeﬁned behavior in implicit conversion

uint16_t a = 65535; // 0xFFFF

uint16_t b = 65535; // 0xFFFF expected: 4'294'836'225

cout << (a * b); // print '-131071' undefined behavior!! (int overflow)

27/83

Undeﬁned Behavior - Signed Overﬂow Example 1 3/7

# include <climits>

# include <cstdio>

void f(int* ptr, int pos) {

pos++;

if (pos < 0) // <-- the compiler could assume that signed overflow never

return; // happen and "simplify" the condition to check

ptr[pos] = 0;

}

int main() { // the code compiled with optimizations, e.g. -O3

int* tmp = new int[10]; // leads to segmentation faults with clang, while

f(tmp, INT_MAX); // it terminates correctly with gcc

printf("%d\n", tmp[0]);

}

28/83

Undeﬁned Behavior - Signed Overﬂow Example 2 4/7

s/open.c of the Linux kernel

int do_fallocate(..., loff_t offset, loff_t len) {

inode *inode = ...;

if (offset < 0 || len <= 0)

return -EINVAL;

/* Check for wrap through zero too */

if ((offset + len > inode->i_sb->s_maxbytes) || (offset + len < 0))

return -EFBIG; // the compiler is able to infer that both 'offset' and

... // 'len' are non-negative and can eliminate this check,

} // without verify integer overflow

29/83

Undeﬁned Behavior - Division by Zero Example 5/7

src/backend/utils/adt/int8.c of PostgreSQL

if (arg2 == 0) {

ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO), // the compiler is not aware

errmsg("division by zero"))); // that this function

} // doesn't return

/* No overflow is possible */

PG_RETURN_INT32((int32) arg1 / arg2); // the compiler assumes that the divisor is

// non-zero and can move this statement on

// the top (always executed)

Undefined Behavior: What Happened to My Code?

30/83

Undeﬁned Behavior - Implicit Overﬂow Example 6/7

Even worse example:

# include <iostream>

int main() {

for (int i = 0; i < 4; ++i)

std::cout << i * 1000000000 << std::endl;

}

// with optimizations, it is an infinite loop

// --> 1000000000 * i > INT_MAX

// undefined behavior!!

// the compiler translates the multiplication constant into an addition

Why does this loop produce undefined behavior?

31/83

Undeﬁned Behavior - Common Loops  7/7

Is the following loop safe?

void f(int size) {

for (int i = 0; i < size; i += 2)

...

}

• What happens if size is equal to INT_MAX ?

• How to make the previous loop safe?

• i >= 0 && i < size is not the solution because of undeﬁned behavior of

signed overﬂow

• Can we generalize the solution when the increment is i += step ?

32/83

Overﬂow / Underﬂow

Detecting wraparound for unsigned integral types is not trivial

// some examples

bool is_add_overflow(unsigned a, unsigned b) {

return (a + b) < a || (a + b) < b;

}

bool is_mul_overflow(unsigned a, unsigned b) {

unsigned x = a * b;

return a != 0 && (x / a) != b;

}

Detecting overﬂow/underﬂow for signed integral types is even harder and must be

checked before performing the operation

33/83

Saturation Arithmetic

⋆

C++26 adds four main functions to perform saturation arithmetic with integer types

in the <numeric> library. In other words, the undeﬁned behavior or the wrap-around

behavior for overﬂow/underﬂow is replaced by saturation values, namely the minimum

or maximum values of the operands

• T add_sat(T x, T y)

• T sub_sat(T x, T y)

• T mul_sat(T x, T y)

• T div_sat(T x, T y)

• R saturate_cast<R>(T x)

34/83

Floating-point Types

and Arithmetic

IEEE Floating-Point Standard

IEEE754 is the technical standard for ﬂoating-point arithmetic

The standard deﬁnes the binary format, operations behavior, rounding rules, exception

handling, etc.

First Release : 1985

Second Release : 2008. Add 16-bit, 128-bit, 256-bit ﬂoating-point types

Third Release : 2019. Specify min/max behavior

see The IEEE Standard 754: One for the History Books

IEEE754 technical document:

754-2019 - IEEE Standard for Floating-Point Arithmetic

In general, C/C++ adopts IEEE754 ﬂoating-point standard:

en.cppreference.com/w/cpp/types/numeric_limits/is_iec559

35/83

32/64-bit Floating-Point

• IEEE754 Single-precision (32-bit) float

Sign

1-bit

Exponent (or base)

8-bit

Mantissa (or signiﬁcant)

23-bit

• IEEE754 Double-precision (64-bit) double

Sign

1-bit

Exponent (or base)

11-bit

Mantissa (or signiﬁcant)

52-bit

36/83

128/256-bit Floating-Point

• IEEE754 Quad-Precision (128-bit) std::float128_t C++23

Sign

1-bit

Exponent (or base)

15-bit

Mantissa (or signiﬁcant)

112-bit

• IEEE754 Octuple-Precision (256-bit) (not standardized in C++)

Sign

1-bit

Exponent (or base)

19-bit

Mantissa (or signiﬁcant)

236-bit

37/83

16-bit Floating-Point

• IEEE754 16-bit Floating-point ( std::binary16_t ) C++23 → GPU, Arm7

Sign

1-bit

Exponent

5-bit

Mantissa

10-bit

• Google 16-bit Floating-point ( std::bfloat16_t ) C++23 → TPU, GPU, Arm8

Sign

1-bit

Exponent

8-bit

Mantissa

7-bit

half-precision-arithmetic-fp16-versus-bfloat16

38/83

8-bit Floating-Point (Non-Standardized in C++/IEEE)

• E4M3

Sign

1-bit

Exponent

4-bit

Mantissa

3-bit

• E5M2

Sign

1-bit

Exponent

5-bit

Mantissa

2-bit

• Floating Point Formats for Machine Learning, IEEE draft

• FP8 Formats for Deep Learning, Intel, Nvidia, Arm

39/83

Other Real Value Representations (Non-standardized in C++/IEEE) 1/2

• TensorFloat-32 (TF32) Specialized ﬂoating-point format for deep learning

applications

• Posit (John Gustafson, 2017), also called unum III (universal number), represents

ﬂoating-point values with variable-width of exponent and mantissa.

It is implemented in experimental platforms

• NVIDIA Hopper Architecture In-Depth

• Beating Floating Point at its Own Game: Posit Arithmetic

• Posits, a New Kind of Number, Improves the Math of AI

• Comparing posit and IEEE-754 hardware cost

40/83

Other Real Value Representations (Non-standardized in C++/IEEE) 2/2

• Microscaling Formats (MX) Speciﬁcation for low-precision ﬂoating-point

formats deﬁned by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm.

It includes FP8, FP6, FP4, (MX)INT8

• Fixed-point representation has a ﬁxed number of digits after the radix point

(decimal point). The gaps between adjacent numbers are always equal. The range

of their values is signiﬁcantly limited compared to ﬂoating-point numbers.

It is widely used on embedded systems

• OCP Microscaling Formats (MX) Specification

41/83

Floating-point Representation 1/2

Floating-point number:

• Radix (or base): β

• Precision (or digits): p

• Exponent (magnitude): e

• Mantissa: M

n = M

|{z}

×β

→ IEEE754: 1.M × 2

float f1 = 1.3f; // 1.3

float f2 = 1.1e2f; // 1.1 · 10

float f3 = 3.7E4f; // 3.7 · 10

float f4 = .3f; // 0.3

double d1 = 1.3; // without "f"

double d2 = 5E3; // 5 · 10

42/83

Floating-point Representation 2/2

Exponent Bias

In IEEE754 ﬂoating point numbers, the exponent value is oﬀset from the actual value

by the exponent bias

• The exponent is stored as an unsigned value suitable for comparison

• Floating point values are lexicographic ordered

• For a single-precision number, the exponent is stored in the range [1, 254] (0 and 255

have special meanings), and is biased by subtracting 127 to get an exponent value in the

range [−126, +127]

0 10000111 11000000000000000000000

+ 2

(135−127)

= 2

= 0.5+ 0.25 = 0.75

normal

→

1.75

+1.75 ∗ 2

= 448.0

43/83

Floating-point - Normal/Denormal 1/2

Normal number

A normal number is a ﬂoating point value that can be represented with at least one

bit set in the exponent or the mantissa has all 0s

Denormal number

Denormal (or subnormal) numbers ﬁll the underﬂow gap around zero in

ﬂoating-point arithmetic. Any non-zero number with magnitude smaller than the

smallest normal number is denormal

A denormal number is a ﬂoating point value that can be represented with all 0s in

the exponent, but the mantissa is non-zero

44/83

Floating-point - Normal/Denormal 2/2

Why denormal numbers make sense: (↓ normal numbers)

The problem: distance values from zero (↓ denormal numbers)

Floating-point representation, by Carl Burch

45/83

Inﬁnity ∞ 1/2

Inﬁnity

In the IEEE754 standard, inf (inﬁnity value) is a numeric data type value that

exceeds the maximum (or minimum) representable value

Operations generating inf :

• ±∞ · ±∞

• ±∞ · ±finite_value

• finite_value op finite_value > max_value

• ﬁnite value / ± 0

There is a single representation for +inf and -inf

Comparison: (inf == finite_value) → false

(±inf == ±inf) → true

46/83

Inﬁnity 2/2

cout << 5.0 / 0.0; // print "inf"

cout << -5.0 / 0.0; // print "-inf"

auto inf = std::numeric_limits<float>::infinity;

cout << (-0.0 == 0.0); // true, 0 == 0

cout << ((5.0f / inf) == ((-5.0f / inf)); // true, 0 == 0

cout << (10e40f) == (10e40f + 9999999.0f); // true, inf == inf

cout << (10e40) == (10e40f + 9999999.0f); // false, 10e40 != inf

47/83

Not a Number (NaN) 1/2

NaN

In the IEEE754 standard, NaN (not a number) is a numeric data type value

representing an undeﬁned or non-representable value

Floating-point operations generating NaN :

• Operations with a NaN as at least one operand

• ±∞ ·∓∞ , 0 ·∞

• 0/0, ∞/∞

•

√

x, log(x) for x < 0

• sin

−1

(x), cos

−1

(x) for x < −1 or x > 1

Comparison: (NaN == x) → false, for every x

(NaN == NaN) → false

48/83

Not a Number (NaN) 2/2

There are many representations for NaN (e.g. 2

− 2 for float)

The speciﬁc (bitwise) NaN value returned by an operation is implementation/compiler

speciﬁc

cout << 0 / 0; // undefined behavior

cout << 0.0 / 0.0; // print "nan" or "-nan"

quiet_NaN

49/83

Machine Epsilon

Machine epsilon

Machine epsilon ε

ε (or machine accuracy) is deﬁned to be the smallest number that

can be added to 1.0 to give a number other than one

IEEE 754 Single precision : ε

ε = 2

−23

≈ 1.19209 ∗10

−7

IEEE 754 Double precision : ε

ε = 2

−52

≈ 2.22045 ∗10

−16

50/83

Units at the Last Place (ULP)

ULP

Units at the Last Place is the gap between consecutive ﬂoating-point numbers

ULP(p, e) = β

e−(p−1)

→ 2

e−(p−1)

Example:

β = 10, p = 3

π = 3.1415926... → x = 3.14 × 10

ULP(3, 0) = 10

−2

= 0.01

Relation with ε

ε:

• ε

ε = ULP(p, 0)

• ULP

= ε

ε ∗ β

e(x)

51/83

Floating-Point Representation of a Real Number

The machine ﬂoating-point representation ﬂ(x) of a real number x is expressed as

ﬂ (x) = x (1 + δ), where δ is a small constant

The approximation of a real number x has the following properties:

Absolute Error: |ﬂ(x) − x| ≤

· ULP

Relative Error:



ﬂ (x) − x



≤

·ε

52/83

Floating-point - Cheatsheet 1/3

• NaN (mantissa = 0)

∗ 11111111 ***********************

• ± inﬁnity

∗ 11111111 00000000000000000000000

• Lowest/Largest (±3.40282 ∗ 10

+38

)

∗ 11111110 11111111111111111111111

• Minimum (normal) (±1.17549 ∗10

−38

)

∗ 00000001 00000000000000000000000

• Denormal number (< 2

−126

)(minimum: 1.4 ∗ 10

−45

)

∗ 00000000 ***********************

• ±0

∗ 00000000 00000000000000000000000

53/83

Floating-point - Cheatsheet 2/3

E4M3 E5M2 float16_t

Exponent 4 [0*-14] (no inf) 5-bit [0*-30]

Bias 7 15

Mantissa 4-bit 2-bit 10-bit

Largest (±)

1.75 ∗ 2

448

1.75 ∗ 2

57, 344

65, 536

Smallest (±)

−6

0.015625

−14

0.00006

Smallest Denormal

−9

0.001953125

−16

1.5258 ∗ 10

−5

−24

6.0 · 10

−8

Epsilon

−4

0.0625

−2

0.25

−10

0.00098

Floating-point - Cheatsheet 3/3

bfloat16_t float double

Exponent 8-bit [0*-254] 11-bit [0*-2046]

Bias 127 1023

Mantissa 7-bit 23-bit 52-bit

Largest (±)

128

3.4 · 10

1024

1.8 · 10

308

Smallest (±)

−126

1.2 · 10

−38

−1022

2.2 · 10

−308

Smallest Denormal /

−149

1.4 · 10

−45

−1074

4.9 · 10

−324

Epsilon

−7

0.0078

−23

1.2 · 10

−7

−52

2.2 · 10

−16

Floating-point - Limits

# include <limits>

// T: float or double

std::numeric_limits<T>::max(); // largest value

std::numeric_limits<T>::lowest(); // lowest value (C++11)

std::numeric_limits<T>::min(); // smallest value

std::numeric_limits<T>::denorm_min() // smallest (denormal) value

std::numeric_limits<T>::epsilon(); // epsilon value

std::numeric_limits<T>::infinity() // infinity

std::numeric_limits<T>::quiet_NaN() // NaN

58/83

Floating-point - Useful Functions

# include <cmath> // C++11

bool std::isnan(T value) // check if value is NaN

bool std::isinf(T value) // check if value is ±infinity

bool std::isfinite(T value) // check if value is not NaN

// and not ±infinity

bool std::isnormal(T value); // check if value is Normal

T std::ldexp(T x, p) // exponent shift x ∗ 2

int std::ilogb(T value) // extracts the exponent of value

59/83

Floating-point Arithmetic Properties 1/3

Floating-point operations are written

• ⊕ addition

• ⊖ subtraction

• ⊗ multiplication

• ⊘ division

⊙ ∈ {⊕, ⊖, ⊗, ⊘}

op ∈ {+, −, ∗, /} denotes exact precision operations

60/83

Floating-point Arithmetic Properties 2/3

(P1) In general, a op b = a ⊙ b

(P2) Not Reﬂexive a = a

• Reﬂexive without NaN

(P3) Not Commutative a ⊙ b = b ⊙ a

• Commutative without NaN (NaN = NaN)

(P4) In general, Not Associative (a ⊙ b) ⊙ c = a ⊙ (b ⊙ c)

• even excluding NaN and inf in intermediate computations

(P5) In general, Not Distributive (a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c)

• even excluding NaN and inf in intermediate computations

61/83

Floating-point Arithmetic Properties 3/3

(P6) Identity on operations is not ensured

• (a ⊖ b) ⊕ b = a

• (a ⊘ b) ⊗ b = a

(P7) Overﬂow/Underﬂow Floating-point has

“saturation” values inf, -inf

• as opposite to integer arithmetic with wrap-around behavior

62/83

Special Values Behavior

Zero behavior

• a ⊘ 0 = inf, a ∈ {ﬁnite − 0} [IEEE-764], undeﬁned behavior in C++

• 0 ⊘ 0, inf ⊘ 0 = NaN [IEEE-764], undeﬁned behavior in C++

• 0 ⊗ inf = NaN

• +0 = -0 but they have a diﬀerent binary representation

Inf behavior

• inf ⊙ a = inf, a ∈ {ﬁnite − 0}

• inf ⊕⊗ inf = inf

• inf ⊖⊘ inf = NaN

• ± inf ⊙ ∓ inf = NaN

• ± inf = ± inf

NaN behavior

• NaN ⊙ a = NaN

• NaN = a

63/83

Floating-Point Undeﬁned Behavior

• Division by zero

e.g., 10

/0.0

• Conversion to a narrower ﬂoating-point type:

e.g., 0.1 double → float

• Conversion from ﬂoating-point to integer:

e.g.,

float → int

• Operations on signaling NaNs: Arithmetic operations that cause an “invalid

operation” exception to be signaled

e.g., inf - inf

• Incorrectly assuming IEEE-754 compliance for all platforms:

e.g., Some embedded Linux distribution on ARM

64/83

Detect Floating-point Errors

⋆

1/2

C++11 allows determining if a ﬂoating-point exceptional condition has occurred by

using ﬂoating-point exception facilities provided in <cfenv>

#include <cfenv>

// MACRO

FE_DIVBYZERO // division by zero

FE_INEXACT // rounding error

FE_INVALID // invalid operation, i.e. NaN

FE_OVERFLOW // overflow (reach saturation value +inf)

FE_UNDERFLOW // underflow (reach saturation value -inf)

FE_ALL_EXCEPT // all exceptions

// functions

std::feclearexcept(FE_ALL_EXCEPT); // clear exception status

std::fetestexcept(<macro>); // returns a value != 0 if an

// exception has been detected

65/83

Detect Floating-point Errors

⋆

2/2

#include <cfenv> // floating point exceptions

#include <iostream>

#pragma STDC FENV_ACCESS ON // tell the compiler to manipulate the floating-point

// environment (not supported by all compilers)

// gcc: yes, clang: no

int main() {

std::feclearexcept(FE_ALL_EXCEPT); // clear

auto x = 1.0 / 0.0; // all compilers

std::cout << (bool) std::fetestexcept(FE_DIVBYZERO); // print true

std::feclearexcept(FE_ALL_EXCEPT); // clear

auto x2 = 0.0 / 0.0; // all compilers

std::cout << (bool) std::fetestexcept(FE_INVALID); // print true

std::feclearexcept(FE_ALL_EXCEPT); // clear

auto x4 = 1e38f * 10; // gcc: ok

std::cout << std::fetestexcept(FE_OVERFLOW); // print true

}

see What is the difference between quiet NaN and signaling NaN?

66/83

Floating-point Issues

Some Examples... 1/4

Ariene 5: data conversion from 64-bit

ﬂoating point value to 16-bit signed in-

teger → $137 million

Patriot Missile: small chopping error

at each operation, 100 hours activity

→ 28 deaths

67/83

Some Examples... 2/4

Integer type is more accurate than ﬂoating type for large numbers

cout << 16777217; // print 16777217

cout << (int) 16777217.0f; // print 16777216!!

cout << (int) 16777217.0; // print 16777217, double ok

float numbers are diﬀerent from double numbers

cout << (1.1 != 1.1f); // print true !!!

68/83

Some Examples... 3/4

The ﬂoating point precision is ﬁnite!

cout << setprecision(20);

cout << 3.33333333f; // print 3.333333254!!

cout << 3.33333333; // print 3.333333333

cout << (0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1); // print 0.59999999999999998

Floating point arithmetic is not associative

cout << 0.1 + (0.2 + 0.3) == (0.1 + 0.2) + 0.3; // print false

IEEE754 Floating-point computation guarantees to produce deterministic output,

namely the exact bitwise value for each run, if and only if the order of the operations

is always the same

→ same result on any machine and for all runs

69/83

Some Examples... 4/4

“Using a double-precision ﬂoating-point value, we can represent easily the

number of atoms in the universe.

If your software ever produces a number so large that it will not ﬁt in a

double-precision ﬂoating-point value, chances are good that you have a bug”

Daniel Lemire, Prof. at the University of Quebec

“ NASA uses just 15 digits of π to calculate interplanetary travel.

With 40 digits, you could calculate the circumference of a circle the size of the

visible universe with an accuracy that would fall by less than the diameter of

a single hydrogen atom”

Latest in space, Twitter

Number of atoms in the universe versus floating-point values

70/83

Floating-point Algorithms

• addition algorithm (simpliﬁed):

(1) Compare the exponents of the two numbers. Shift the smaller number to the right until its

exponent would match the larger exponent

(2) Add the mantissa

(3) Normalize the sum if needed (shift right/left the exponent by 1)

• multiplication algorithm (simpliﬁed):

(1) Multiplication of mantissas. The number of bits of the result is twice the size of the operands

(46 + 2 bits, with +2 for implicit normalization)

(2) Normalize the product if needed (shift right/left the exponent by 1)

(3) Addition of the exponents

• fused multiply-add (fma):

• Recent architectures (also GPUs) provide fma to compute addition and multiplication in a single

instruction (performed by the compiler in most cases)

• The rounding error of fma(x, y, z) is less than (x ⊗ y ) ⊕ z

71/83

Catastrophic Cancellation 1/5

Catastrophic Cancellation

Catastrophic cancellation (or loss of signiﬁcance) refers to loss of relevant

information in a ﬂoating-point computation that cannot be revered

Two cases:

(C1) a ± b, where a ≫ b or b ≫ a. The value (or part of the value) of the smaller

number is lost

(C2) a − b, where a, b are approximation of exact values and a ≈ b, namely a loss of

precision in both a and b. a − b cancels most of the relevant part of the result

because a ≈ b. It implies a small absolute error but a large relative error

72/83

Catastrophic Cancellation (case 1) - Granularity 2/5

Intersection = 16, 777, 216 = 2

73/83

Catastrophic Cancellation (case 1) 3/5

How many iterations performs the following code?

while (x > 0)

x = x - y;

How many iterations?

float: x = 10,000,000 y = 1 -> 10,000,000

float: x = 30,000,000 y = 1 -> does not terminate

float: x = 200,000 y = 0.001 -> does not terminate

bfloat: x = 256 y = 1 -> does not terminate !!

74/83

Catastrophic Cancellation (case 1) 4/5

Floating-point increment

float x = 0.0f;

for (int i = 0; i < 20000000; i++)

x += 1.0f;

What is the value of x at the end of the loop?

Ceiling division





// std::ceil((float) 101 / 2.0f) -> 50.5f -> 51

float x = std::ceil((float) 20000001 / 2.0f);

What is the value of x ?

75/83

Catastrophic Cancellation (case 2) 5/5

Let’s solve a quadratic equation:

1,2

−b ±

√

− 4ac

+ 5000x + 0.25

(-5000 + std::sqrt(5000.0f * 5000.0f - 4.0f * 1.0f * 0.25f)) / 2 // x2

(-5000 + std::sqrt(25000000.0f - 1.0f)) / 2 // catastrophic cancellation (C1)

(-5000 + std::sqrt(25000000.0f)) / 2

(-5000 + 5000) / 2 = 0 // catastrophic cancellation (C2)

// correct result: 0.00005!!

relative error:

|0 − 0.00005|

0.00005

= 100%

76/83

Floating-point Comparison 1/3

The problem

cout << (0.11f + 0.11f < 0.22f); // print true!!

cout << (0.1f + 0.1f > 0.2f); // print true!!

Do not use absolute error margins!!

bool areFloatNearlyEqual(float a, float b) {

if (std::abs(a - b) < epsilon); // epsilon is fixed by the user

return true;

return false;

}

Problems:

• Fixed epsilon “looks small" but it could be too large when the numbers being compared

are very small

• If the compared numbers are very large, the epsilon could end up being smaller than the

smallest rounding error, so that the comparison always returns false

77/83

Floating-point Comparison 2/3

Solution: Use relative error

|a−b|

< ε

bool areFloatNearlyEqual(float a, float b) {

if (std::abs(a - b) / b < epsilon); // epsilon is fixed

return true;

return false;

}

Problems:

• a=0, b=0 The division is evaluated as 0.0/0.0 and the whole if statement is (nan <

espilon) which always returns false

• b=0 The division is evaluated as abs(a)/0.0 and the whole if statement is (+inf <

espilon) which always returns false

• a and b very small. The result should be true but the division by b may produces

wrong results

• It is not commutative. We always divide by b

78/83

Floating-point Comparison 3/3

Possible solution:

|a−b|

max(|a|,|b|)

< ε

bool areFloatNearlyEqual(float a, float b) {

constexpr float normal_min = std::numeric_limits<float>::min();

constexpr float relative_error = <user_defined>

if (!std::isfinite(a) || !isfinite(b)) // a = ±∞, NaN or b = ±∞, NaN

return false;

float diff = std::abs(a - b);

// if "a" and "b" are near to zero, the relative error is less effective

if (diff <= normal_min) // or also: user_epsilon * normal_min

return true;

float abs_a = std::abs(a);

float abs_b = std::abs(b);

return (diff / std::max(abs_a, abs_b)) <= relative_error;

}

79/83

Minimize Error Propagation - Summary

• Prefer multiplication/division rather than addition/subtraction

• Try to reorganize the computation to keep near numbers with the same scale

(e.g. sorting numbers)

• Consider putting a zero very small number (under a threshold). Common

application: iterative algorithms

• Scale by a power of two is safe

• Switch to log scale. Multiplication becomes Add, and Division becomes

Subtraction

• Use a compensation algorithm like Kahan summation, Dekker’s FastTwoSum,

Rump’s AccSum

80/83

References

Suggest readings:

• What Every Computer Scientist Should Know About Floating-Point Arithmetic

• Do Developers Understand IEEE Floating Point?

• Yet another floating point tutorial

• Unavoidable Errors in Computing

Floating-point Comparison readings:

• The Floating-Point Guide - Comparison

• Comparing Floating Point Numbers, 2012 Edition

• Some comments on approximately equal FP comparisons

• Comparing Floating-Point Numbers Is Tricky

Floating point tools:

• IEEE754 visualization/converter

• Find and fix floating-point problems

81/83

System/360 Model 44

Ken Shirriff: Want to adjust your computer’s floating point precision by turning

a knob? You could do that on the System/360 Model 44

82/83

On Floating-Point

83/83