Modern C++

Programming

22. Performance Optimization III

Non-Coding Optimizations and Benchmarking

Federico Busato

2024-03-29

Table of Contents

1 Compiler Optimizations

About the Compiler

Compiler Optimization Flags

Floating-point Optimization Flags

Linker Optimization Flags

Architecture Flags

Help the Compiler to Produce Better Code

Proﬁle Guided Optimization (PGO)

Post-Processing Binary Optimizer

1/76

Table of Contents

2 Compiler Transformation Techniques

Basic Compiler Transformations

Loop Unswitching

Loop Fusion

Loop Fission

Loop Interchange

Loop Tiling

3 Libraries and Data Structures

External Libraries

2/76

Table of Contents

4 Performance Benchmarking

What to Test?

Workload/Dataset Quality

Cache Behavior

Stable CPU Performance

Multi-Threads Considerations

Program Memory Layout

Measurement Overhead

Compiler Optimizations

Metric Evaluation

3/76

Table of Contents

5 Proﬁling

gprof

uftrace

callgrind

cachegrind

perf Linux proﬁler

4/76

Table of Contents

6 Parallel Computing

Concurrency vs. Parallelism

Performance Scaling

Gustafson’s Law

Parallel Programming Languages

5/76

Compiler

Optimizations

About Compiler Optimizations 1/3

”I always say the purpose of optimizing compilers is not to make code

run faster, but to prevent programmers from writing utter **** in the

pursuit of making it run faster“

Rich Felker, musl-libc (libc alternative)

6/76

About Compiler Optimizations 2/3

bool isEven(int number) {

int numberCompare = 0;

bool even = true;

while (number != numberCompare) {

even = !even;

numberCompare++;

}

return even;

}

→

bool isEven(int number) {

return number & 1u;

}

Exploring Clang/LLVM optimization on programming horror

7/76

About Compiler Optimizations 3/3

On the other hand, having a good compiler does not mean that it can fully optimize

any code:

• The compiler does not “understand” the code, as opposed to human

• The compiler is conservative and applies optimizations only if they are safe and do

not aﬀect the correctness of computation

• The compiler is full of models and heuristics that could not match a speciﬁc

situation

• The compiler cannot spend large amount of time in code optimization

• The compiler could consider other targets outside performance, e.g. binary size

8/76

About the Compiler 1/2

Important advise: Use an updated version of the compiler

• Newer compiler produces better/faster code

- Eﬀective optimizations

- Support for newer CPU architectures

• New warnings to avoid common errors and better support for existing

error/warnings (e.g. code highlights)

• Faster compiling, less memory usage

• Less compiler bugs: compilers are very complex and they have many bugs

Use an updated version of the linker: e.g. for Link Time Optimization,

gold linker or LLVM linker lld

9/76

About the Compiler 2/2

Which compiler?

Answer: It dependents on the code and on the processor

example: GCC 9 vs. Clang 8

Some compilers can produce optimized code for speciﬁc architectures:

• Intel Compiler (commercial): Intel processors

• IBM XL Compiler (commercial): IBM processors/system

• Nvidia NVC++ Compiler (free/commercial): Multi-core processors/GPUs

• gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

• Intel Blog: gcc-x86-performance-hints

• Advanced Optimization and New Capa-bilities of GCC 10

10/76

Compiler Optimization Flags 1/2

-O0 , /Od Disables any optimization

• default behavior

• fast compile time

-O1 , /O1 Enables basic optimizations

-O2 , /O2 Enables advanced optimizations

• some optimization steps are expensive

• can increase the binary size

-O3 Enable aggressive optimizations. Turns on all optimizations speciﬁed by

-O2, plus some more

• -O3 does not guarantee to produce faster code than -O2

• it could break ﬂoating-point IEEE754 rules in some non-traditional

compilers (nvc++, IBM xlc)

11/76

Compiler Optimization Flags 2/2

-O4 / -O5 It is an alias of -O3 in some compilers, or it can refer to -O3 +

inter-procedural optimizations (basic, full) and high-order

transformation (HOT) optimizer for specialized loop transformations

-Ofast Provides other aggressive optimizations that may violate strict

compliance with language standards. It includes -O3 -ffast-math

-Os , /Os Optimize for size. It enables all -O2 optimizations that do not

typically increase code size (e.g. loop unrolling)

-Oz Aggressively optimize for size

-funroll-loops Enables loop unrolling (not included in -O3 )

-fopt-info Describes optimization passes and missed optimizations

-fopt-info-missed

12/76

Floating-point Optimization Flags 1/2

In general, enabling the following ﬂags implies less ﬂoating-point accuracy, breaking

the IEEE754 standard, and it is implementation dependent (not included in -O3 )

-fno-signaling-nans

-fno-trapping-math Disable ﬂoating-point exceptions

-mfma -ffp-contract=fast Force ﬂoating-point expression contraction such as

forming of fused multiply-add operations

-ffinite-math-only Disable special conditions for handling inf and NaN

-fassociative-math Assume ﬂoating-point associative behavior

13/76

Floating-point Optimization Flags 2/2

-funsafe-math-optimizations Allows breaking ﬂoating-point associativity and

enables reciprocal optimization

-ffast-math Enables aggressive ﬂoating-point optimizations. All

the previous, ﬂush-to-zero denormal number, plus

others

Beware of fast-math

Semantics of Floating Point Math in GCC

14/76

Linker Optimization Flags

-flto Enables Link Time Optimizations (Interprocedural Optimization). The

linker merges all modules into a single combined module for

optimization

• the linker must support this feature: GNU ld v2.21++ or gold version,

to check with ld --version

• it can signiﬁcantly improve the performance

• in general, it is a very expensive step, even longer than the object

compilations

-fwhole-program Assume that the current compilation unit represents the whole

program being compiled → Assume that all non-extern functions and

variables belong only to their compilation unit

Ubuntu 21.04 To Turn On LTO Optimizations For Its Packages

15/76

Architecture Flags - 32-bits or 64-bits? 1/3

Architecture-oriented optimizations are not included in other ﬂags ( -O3 )

-m64 In 64-bit mode the number of available registers increases from 6 to 14 general

and from 8 to 16 XMM. Also, all 64-bits x86 architectures have SSE2 extension by

default. 64-bit applications can use more than 4GB address space

-m32 32-bit mode. It should be combined with -mfpmath=sse to enable using of XMM

registers in ﬂoating point instructions (instead of stack in x87 mode). 32-bit

applications can use less than 4GB address space

It is recommended to use 64-bits for High-Performance Computing applications and

32-bits for phone and tablets applications

16/76

Architecture Flags 2/3

-march=<arch> Generates instructions for a speciﬁc processor to exploit exclusive

hardware features. <arch> represents the minimum hardware

supported by the binaries (not portable)

-mtune=<tune arch> Speciﬁes the target microarchitecture. Generates optimized code

for a class of processors without exploiting speciﬁc hardware

features. Binaries are still compatibles with other processors, e.g.

earlier CPUs in the architecture family (maybe slower than

-march )

-mcpu=<tune arch> Deprecated synonym for -mtune for x86-64 processors, optimizes

for both a particular architecture and microarchitecture on Arm

-mfpu<fp hw> (Arm) Optimize for a speciﬁc ﬂoating-point hardware

-m<instr set> (x86-64) Optimize for a speciﬁc instruction set

17/76

Architecture Flags 3/3

<arch> armv9-a , armv7-a+neon-vfpv4 , znver4 , core2 , skylake

<tune arch> cortex-a9 , neoverse-n2 , generic , intel

<instr set> see2 , avx512

<fp hw> neon , neon-fp-armv8

• <tune arch> should be always greater than <arch>

• In general, -mtune is set to generic if not speciﬁed

• -march=native , -mtune=native , -mcpu=native : Allows the compiler to

determine the processor type (not always accurate)

• Especially with new compilers, prefer auto-vectorization to explicit vector

intrinsics

• GCC Arm options, GCC X86-64 options

• Compiler flags across architectures: -march, -mtune, and -mcpu

• NVIDIA Grace CPU Benchmarking Guide, Arm Vector Instructions: SVE and NEON

18/76

Help the Compiler to Produce Better Code

• Grouping variables and functions related to each other in the same translation unit

• Deﬁne global variables and functions in the translation unit in which they are

used more often

• Global variables and functions that are not used by other translation units should

have internal linkage (anonymous namespace/ static function)

Static library linking helps the linker to optimize the code across diﬀerent

modules (link-time optimizations). Dynamic linking prevents these kinds of

optimizations

19/76

Proﬁle Guided Optimization (PGO) 1/2

Proﬁle Guided Optimization (PGO) is a compiler technique aims at improving the

application performance by reducing instruction-cache problems, reducing branch

mispredictions, etc. PGO provides information to the compiler about areas of an

application that are most frequently executed

It consists in the following steps:

(1) Compile and instrument the code

(2) Run the program by exercising the most used/critical paths

(3) Compile again the code and exploit the information produced in the previous step

The particular options to instrument and compile the code are compiler speciﬁc

20/76

Proﬁle Guided Optimization (PGO) 2/2

GCC

$ gcc -fprofile-generate my_prog.c my_prog # program instrumentation

$ ./my_prog # run the program (most critial/common path)

$ gcc -fprofile-use -O3 my_prog.c my_prog # use instrumentation info

Clang

$ clang++ -fprofile-instr-generate my_prog.c my_prog

$ ./my_prog

$ xcrun llvm-profdata merge -output default.profdata default.profraw

$ clang++ -fprofile-instr-use=default.profdata -O3 my_prog.c my_prog

e.g. Firefox and Google Chrome support PGO building

21/76

PGO, LTO Performance

SPEC 2017 built with GCC 10.2 and -O2

22/76

Post-Processing Binary Optimizer 1/2

The code layout in the ﬁnal binary can be further optimized with a post-link binary

optimizer and layout optimization like BOLT or Propeller (sampling or

instrumentation proﬁle)

BOLT: A Practical Binary Optimizer for Data Centers and Beyond

BOLT optimization technology could bring obvious performance uplift on arm server

23/76

Post-Processing Binary Optimizer 2/2

The many faces of LLVM PGO and FDO

24/76

Polyhedral Optimizations

Polyhedral optimization is a compilation technique that

rely on the representation of programs, especially those involving

nested loops and arrays, in parametric polyhedra. Thanks to

combinatorial and geometrical optimizations on these objects, the

compiler is able to analyze and optimize the programs including automatic

parallelization, data locality, memory management, SIMD instructions, and code

generation for hardware accelerators

Polly  is a high-level loop and data-locality optimizer and optimization infrastructure

for LLVM

PLUTO  is an automatic parallelization tool based on the polyhedral model

see also Using Polly with Clang

25/76

Compiler

Transformation

Techniques

Help the Compiler to Produce Better Code

Overview on compiler code generation and transformation:

• Optimizations in C++ Compilers

Matt Godbolt, ACM Queue

• Compiler Optimizations

26/76

Basic Compiler Transformations 1/3

• Constant folding. Direct evaluation constant expressions at compile-time

const int K = 100 * 1234 / 2;

• Constant propagation. Substituting the values of known constants in

expressions at compile-time

const int K = 100 * 1234 / 2;

const int J = K * 25;

• Common subexpression elimination. Avoid computing identical and redundant

expressions

int x = y * z + v;

int y = y * z + k; // y * z is redundant

27/76

Compiler Transformations 2/3

• Induction variable elimination. Eliminate variables whose values are dependent

(induction)

for (int i = 0; i < 10; i++)

x = i * 8;

// "x" can be derived by knowing the value of "i"

• Dead code elimination. Elimination of code which is executed but whose result

is never used, e.g. dead store

int a = b * c;

... // "a" is never used, "b * c" is not computed

Unreachable code elimination instead involves removing code that is never

executed

28/76

Compiler Transformations 3/3

• Use-deﬁne chain. Avoid computations related to a variable that happen before

its deﬁnition

x = i * k + l;

x = 32; // "i * k + l" is not needed

• Peephole optimization. Replace a small set of low-level instructions with a

faster sequence of instructions with better performance and the same semantic.

The optimization can involve pattern matching

imul eax, eax, 8 // a * 8

sal eax, 3 // a << 3 (shift)

29/76

Loop Unswitching

• Loop Unswitching. Split the loop to improve data locality, reduce loop

instructions (especially branches), and allow additional optimizations

for (i = 0; i < N; i++) {

if (x)

a[i] = 0;

else

b[i] = 0;

}

if (x) {

for (i = 0; i < N; i++)

a[i] = 0; // use memset

}

else {

for (i = 0; i < N; i++)

b[i] = 0; // use memset

}

30/76

Loop Fusion

• Loop Fusion (jamming). Merge multiple loops to improve data locality and

perform additional optimizations

for (i = 0; i < 300; i++)

a[i] = a[i] + sqrt(i);

for (i = 0; i < 300; i++)

b[i] = b[i] + sqrt(i);

for (i = 0; i < 300; i++) {

a[i] = a[i] + sqrt(i); // sqrt(i) is computed only

b[i] = b[i] + sqrt(i); // one time

}

31/76

Loop Fission

• Loop Fission (distribution). Split a loop in multiple loops to

for (i = 0; i < size; i++) {

a[i] = b[rand()]; // cache pollution

c[i] = d[rand()];

}

for (i = 0; i < size; i++)

a[i] = b[rand()]; // better cache utilization

for (i = 0; i < size; i++)

c[i] = d[rand()];

32/76

Loop Interchange

• Loop Interchange. Exchange the order of loop iterations to improve data locality

and perform additional optimizations (e.g. vectorization)

for (i = 0; i < 1000000; i++) {

for (j = 0; j < 100; j++)

a[j * x + i] = ...; // low locality

}

for (j = 0; j < 100; j++) {

for (i = 0; i < 1000000; i++)

a[j * x + i] = ...; // high locality

}

33/76

Loop Tiling

• Loop Tiling (blocking, nest optimization). Partition the iterations of multiple

loops to exploit data locality

for (i = 0; i < N; i++) {

for (j = 0; j < M; j++)

a[j * N + i] = ...; // low locality

}

for (i = 0; i < N; i += TILE_SIZE) {

for (j = 0; j < M; j += TILE_SIZE) {

for (k = 0; k < TILE_SIZE; k++) {

for (l = 0; l < TILE_SIZE; l++) {

34/76

Libraries and Data

Structures

External Libraries 1/3

Consider using optimized external libraries for critical program operations

• Compressed Bitmask: set algebraic operations

• BitMagic Library

• Roaring Bitmaps

• Ordered Map/Set: B+Tree as replacement for red-black tree

• STX B+Tree

• Abseil B-Tree

• Hash Table: (replace for std::unsorted set/map )

• Google Sparse/Dense Hash Table

• bytell hashmap

• Facebook F14 memory efficient hash table

• Abseil Hashmap (2x-3x faster)

• Robin Hood Hashing

• Comprehensive C++ Hashmap Benchmarks 2022

35/76

External Libraries 2/3

• Probabilistic Set Query: Bloom ﬁlter, ‘XOR filter, Facebook’s Ribbon

Filter, Binary Fuse filter

• Scan, print, and formatting: fmt library, scn library instead of iostream

or printf/scanf

• Random generator: PCG random generator instead of Mersenne Twister or

Linear Congruent

• Non-cryptographic hash algorithm: xxHash instead of CRC

• Cryptographic hash algorithm: BLAKE3 instead of MD5 or SHA

36/76

External Libraries 3/3

• Linear Algebra: Eigen, Armadillo, Blaze

• Sort:

• Beating Up on Qsort. Radix-sort for non-comparative elements (e.g. int ,

float )

• Vectorized and performance-portable Quicksort

• malloc replacement:

• tcmalloc (Google)

• mimalloc (Microsoft)

37/76

Libraries and Std replacements

• Folly: Performance-oriented std library (Facebook)

• Abseil: Open source collection of C++ libraries drawn from the most

fundamental pieces of Google’s internal codebase

• Frozen: Zero-cost initialization for immutable containers, ﬁxed-size containers,

and various algorithms.

A curated list of awesome header-only

C++ libraries

38/76

Performance

Benchmarking

Performance Benchmarking

Performance benchmarking is a non-functional test focused on mea-

suring the eﬃciency of a given task or program under a particular load

Performance benchmarking is hard!!

Main reasons:

• What to test?

• Workload/Dataset quality

• Cache behavior

• Stable CPU performance

• Program memory layout

• Measurement overhead

• Compiler optimizations

• Metric evaluation

39/76

What to Test?

1. Identify performance metrics: The metric(s) should be strongly related to the

speciﬁc problem and that allows a comparison across diﬀerent systems, e.g.

elapsed time is not a good metric in general for measuring the throughput

- Matrix multiplication: FLoating-point Operation Per Second (FLOP/S)

- Graph traversing: Edge per Second (EPS)

2. Plan performance tests: Determine what part of the problem is relevant for

solving the given problem, e.g. excluding initialization process

- Suppose a routine that requires diﬀerent steps and ask a memory buﬀer for each of

them. Memory allocations should be excluded as a user could use a memory pool

40/76

Workload/Dataset Quality

1. Stress the most important cases: Rare or edge cases that are not used in

real-world applications or far from common usage are less important, e.g. a graph

problem where all vertices are not connected

2. Use datasets that are well-known in the literature and reproducible. Don’t

use “self-made” dataset and, if possible, use public available resources

3. Use a reproducible test methodology. Trying to remove sources of “noise”,

e.g. if the procedure is randomized, the test should be use with the same seed. It

is not always possible, e.g. OS scheduler, atomic operations in parallel computing,

etc.

see also Reproducibility in artificial intelligence

41/76

Cache Behavior 1/2

• Cache behavior is not deterministic. Diﬀerent executions lead to diﬀerent hit rates

• After a data is loaded from the main memory, it remains in the cache until it

expires or is evicted to make room for new content

• Executing the same routine multiple times, the ﬁrst run is much slower than the

other ones due to the cache eﬀect (warmup run)

42/76

Cache Behavior 2/2

There is no a systematic way to ﬂush the cache. Some techniques to ensure more

reliable performance results are

• overwrite all data involved in the computation between each runs

• read/write between two buﬀers of size at least the size of the largest cache

• some processors, such as ARM, provide speciﬁc instructions to invalidate the

cache builtin clear cache() , clear cache()

Note: manual cache invalidation must consider cache locality (e.g. L1 per CPU core) and

compiler optimizations that can remove useless code (solution: use global variables and

volatile )

see: Is there a way to flush the entire CPU cache related to a program?

43/76

Stable CPU Performance 1/4

One of the ﬁrst source of ﬂuctuation in performance measurement is due to unstable

CPU frequency

Dynamic frequency scaling, also known as CPU throttling, automatically decreases

the CPU frequency for:

• Power saving, extending battery life

• Decrease fan noise and chip heat

• Prevent high frequency damage

Modern processors also comprise advanced technologies to automatically raise CPU

operating frequency when demanding tasks are running (e.g. Intel® Turbo

Boost). Such technologies allow processors to run with the highest possible frequency

for limited amount of time depending on diﬀerent factors like type of workload,

number of active cores, power consumption, temperature, etc.

44/76

Stable CPU Performance 2/4

Get CPU info:

• CPU characteristics:

lscpu

• Monitor CPU clocks in real-time:

cpupower monitor -m Mperf

• Get CPU clocks info:

cpupower frequency-info

see “cpufreq governors”

45/76

Stable CPU Performance 3/4

• Disable Turbo Boost

echo 1 >> /sys/devices/system/cpu/intel pstate/no turbo

• Disable hyper threading

echo 0 > /sys/devices/system/cpu/cpuX/online

or through BIOS

• Use “performance” scaling governor

sudo cpupower frequency-set -g performance

• Set CPU aﬃnity (CPU-Program binding) taskset -c <cpu id> <program>

• Set process priority sudo nice -n -5 taskset -c <cpu id> <process>

46/76

Stable CPU Performance 4/4

• Disable address space randomization

echo 0 | sudo tee /proc/sys/kernel/randomize va space

• Drop ﬁle system cache (if the benchmark involves IO ops)

echo 3 | sudo tee /proc/sys/vm/drop caches; sync

• CPU isolation

don’t schedule process and don’t run kernels code on the selected CPUs. GRUB

options: isolcpus=<cpu ids>,rcu nocbs=<cpu ids>

• How to get consistent results when benchmarking on Linux?

• How to run stable benchmarks

• Best Practices When Benchmarking CUDA Applications

47/76

Multi-Threads Considerations

• numactl --interleave=all

NUMA: Non-Uniform Memory Access (e.g. multi-socket system)

The default behavior is to allocate memory in the same node as a thread is

scheduled to run on, and this works well for small amounts of memory. However,

when you want to allocate more than a single node memory, it is no longer

possible. This option sets interleaved memory allocations among NUMA nodes

• export OMP NUM THREADS=96 Set the number of threads in an OpenMP

program

48/76

Program Memory Layout

A small code change modiﬁes the memory program layout

→ large impact on cache (up to 40%)

• Linking

- link order → changes function addresses

- upgrade a library

• Environment Variable Size: moves the program stack

- run in a new directory

- change username

•Performance Matters, E. Berger, CppCon20

•Producing Wrong Data Without Doing Anything Obviously Wrong!, Mytkowicz et al.,

ASPLOS’09

49/76

Measurement Overhead

Time-measuring functions could introduce signiﬁcant overhead for small

computation

std::chrono::high resolution clock::now() /

std::chrono::system clock::now() rely on library/OS-provided functions to

retrieve timestamps (e.g. clock gettime ) and their execution can take several clock

cycles

Consider using a benchmarking framework, such as Google Benchmark or

nanobench ( std::chrono based), to retrieve hardware counters and get basic

proﬁling info

50/76

Compiler Optimizations

Compiler optimizations could distort the actual benchmark

• Dead code elimination: the compiler discards code that does not perform “useful”

computation

• Constant propagation/Loop optimization: the compiler is able to pre-compute the

result of simple codes

• Instruction order: the compiler can even move the time-measuring functions

Microbenchmarking Is Tricky

51/76

Other Considerations

The actual values for a benchmark could signiﬁcantly aﬀect the results. For

instance, a GEMM operation could show 2X performance between matrices ﬁlled with

zeros and random values due to the eﬀect on power consumption

52/76

Metric Evaluation 1/6

After extracting and collecting performance results, it is fundamental to

report/summarize them in a way to fully understand the experiment, provide

interpretable insights, ensure reliability, and compare diﬀerent observations, e.g. codes,

algorithms, systems, etc.

53/76

Metric Evaluation 2/6

Metric Formula Description

Arithmetic mean ¯x =

i =1

For summarizing costs, e.g. exec. times, ﬂoating point ops, etc.

Harmonic mean

i =1

For summarizing rates, e.g. ﬂop/s

Geometric mean

i =1

For summarizing rates. Harmonic mean should be preferred.

Commonly used for comparing speedup

Standard deviation

σ =

i =1

−x )

n−1

Measure of the spread of normally distributed samples

Coeﬃcient of

Variation

std.dev

arith.mean

Represents the stability of a set of normally distributed

measurement results. Normalized standard deviation

54/76

Metric Evaluation 3/6

Metric Formula Description

Conﬁdence intervals

of the mean

z = t



n − 1,



CI =

¯x −

z σ

√

, ¯x +

z σ

√

Measure of reliability of the experiment. The concept

is interpreted as the probability (e.g. α = 95%) that

the observed conﬁdential interval contains the true

mean

Median

value at position n/2

after sorting all data

Rank measures are more robust with regard to

outliers but do not consider all measured values

Quantile:

Percentile/Quartile

value at a given position

after sorting all data

The percentiles/quartiles provide information about

the spread of the data and the skew. It indicates the

value below which a given percentage of data falls

Minumum/

Maximum

min / max

i =1

)

Provide the lower/upper bounds of the data, namely

the range of the values

55/76

Metric Evaluation 4/6

Conﬁdence Interval Z

80% 1.282

85% 1.440

90% 1.645

95% 1.960

99% 2.576

99.5% 2.807

99.9% 3.291

Some metrics assume a normal distribution → the arithmetic mean, median and mode

are all equal

|¯x − median|

max (¯x, median)

If the relative diﬀerence between the mean and median is larger than 1%, values are

probably not normally distributed

56/76

Metric Evaluation 5/6

Minimum/Maximum vs. Arithmetic mean. The minimum/maximum could be used

to get the best outcome of an experiment, namely the measure with the least noise.

On the other hand, the arithmetic mean considers all values and could better represent

the behavior of the experiment.

If the skewness of the distribution is symmetrical (e.g. normal, binomial) then the

arithmetic mean is a superior statistic, while the minimum/maximum could be useful

in the opposite case (e.g. log-normal distribution)

57/76

Metric Evaluation 6/6

• Benchmarking: minimum vs average

• Scientific Benchmarking of Parallel Computing Systems

• Benchmarking C++ Code

58/76

Proﬁling

Overview

A code proﬁler is a form of dynamic program analysis which aims at investigating the

program behavior to ﬁnd performance bottleneck. A proﬁler is crucial in saving time

and eﬀort during the development and optimization process of an application

Code proﬁlers are generally based on the following methodologies:

• Instrumentation Instrumenting proﬁlers insert special code at the beginning and

end of each routine to record when the routine starts and when it exits. With this

information, the proﬁler aims to measure the actual time taken by the routine on

each call.

Problem: The timer calls take some time themselves

• Sampling The operating system interrupts the CPU at regular intervals (time slices)

to execute process switches. At that point, a sampling proﬁler will record the

currently-executed instruction

59/76

gprof

gprof is a proﬁling program which collects and arranges timing statistics on a given

program. It uses a hybrid of instrumentation and sampling programs to monitor

function calls

Website: sourceware.org/binutils/docs/gprof/

Usage:

• Code Instrumentation

$ g++ -pg [flags] <source_files>

Important: -pg is required also for linking and it is not supported by clang

• Run the program (it produces the ﬁle gmon.out)

• Run gprof on gmon.out

$ gprof <executable> gmon.out

• Inspect gprof output

60/76

gprof 2/2

gprof output

gprof can be also used for showing the call graph statistics

$ gprof -q <executable> gmon.out

61/76

uftrace

The uftrace tool is to trace and analyze execution of a program written in C/C++

Website: github.com/namhyung/uftrace

$ gcc -pg <program>.cpp

$ uftrace record <executable>

$ uftrace replay

Flame graph output in html and svg

62/76

callgrind

callgrind is a proﬁling tool that records the call history among functions in a

program’s run as a call-graph. By default, the collected data consists of the number of

instructions executed

Website: valgrind.org/docs/manual/cl-manual.html

Usage:

• Proﬁle the application with callgrind

$ valgrind --tool callgrind <executable> <args>

• Inspect callgrind.out.XXX ﬁle, where XXX will be the process identiﬁer

63/76

cachegrind

cachegrind simulates how your program interacts with a machine’s cache hierarchy

and (optionally) branch predictor

Website: valgrind.org/docs/manual/cg-manual.html

Usage:

• Proﬁle the application with cachegrind

$ valgrind --tool cachegrind --branch-sim=yes <executable> <args>

• Inspect the output (cache misses and rate)

- l1 L1 instruction cache

- D1 L1 data cache

- LL Last level cache

64/76

kcachegrind and qcachegrindwin (View)

KCachegrind (linux) and Qcachegrind (windows) provide a graphical interface for

browsing the performance results of callgraph

•kcachegrind.sourceforge.net/html/Home.html

•sourceforge.net/projects/qcachegrindwin

65/76

gprof2dot (View)

gprof2dot is a Python script to convert the output from many proﬁlers into a dot

graph

Website: github.com/jrfonseca/gprof2dot

66/76

perf Linux proﬁler 1/2

Perf is performance monitoring and analysis tool for Linux. It uses statistical proﬁling,

where it polls the program and sees what function is working

Website: perf.wiki.kernel.org/index.php/Main Page

$ perf record -g <executable> <args> // or

$ perf record --call-graph dwarf <executable>

$ perf report // or

$ perf report -g graph --no-children

Linux perf for Qt developers

67/76

perf Linux proﬁler 2/2

Data collected by perf can be visualized by using ﬂame graphs, see:

Speedscope: visualize what your program is doing and where it is

spending time

68/76

Other Proﬁlers

Free proﬁler:

• Hotspot

Proprietary proﬁler:

• Intel VTune

• AMD CodeAnalyst

69/76

Parallel Computing

Concurrency vs. Parallelism

Concurrency

A system is said to be concurrent if it can support two or more actions in progress

at the same time. Multiple processing units work on diﬀerent tasks independently

Parallelism

A system is said to be parallel if it can support two or more actions executing

simultaneously. Multiple processing units work on the same problem and their

interaction can eﬀect the ﬁnal result

Note: parallel computation requires rethinking original sequential algorithms (e.g.

avoid race conditions)

70/76

Performance Scaling

Strong Scaling

The strong scaling deﬁned how the compute time decreases increasing the number

of processors for a ﬁxed total problem size

Weak Scaling

The weak scaling deﬁned how the compute time decrease increasing the number of

processors for a ﬁxed total problem size per processor

Strong scaling is hard to achieve because of computation units communication. Strong

scaling is in contrast to the Amdahl’s Law

71/76

Gustafson’s Law

Increasing number of processor units allow solving larger problems in the same time

(the computation time is constant)

Multiple problem instances can run concurrently with more computational resources

72/76

Parallel Programming Platforms and APIs 1/3

C++11 Threads (+ Parallel STL) free, multi-core CPUs

OpenMP free, directive-based, multi-core CPUs and GPUs (last versions)

OpenACC free, directive-based, multi-core CPUs and GPUs

Khronos OpenCL free, multi-core CPUs, GPUs, FPGA

Nvidia CUDA free, Nvidia GPUs

AMD ROCm free, AMD GPUs

HIP free, heterogeneous-compute Interface for AMD/Nvidia GPUs

73/76

Parallel Programming Platforms and APIs 2/3

Khronos SyCL free, abstraction layer for OpenCL, OpenMP, C/C++ libraries,

multi-core CPUs and GPUs

KoKKos (Sandia) free, abstraction layer for multi-core CPUs and GPUs

Raja (LLNL) free, abstraction layer for multi-core CPUs and GPUs

Intel TBB commercial, multi-core CPUs

OneAPI free, Data Parallel C++ (DPC++) built upon C++ and SYCL,

CPUs, GPUs, FPGA, accelerators

MPI free, de-facto standard for distributed system

74/76

Parallel Programming Platforms and APIs 3/3

75/76

A Nice Example

Accelerates computational chemistry simulations from 14 hours to 47 seconds with

OpenACC on GPUs (∼ 1, 000x Speedup)

Accelerating Prediction of Chemical Shift of Protein Structures on GPUs

76/76