Write a benchmark (part 2)
(still no SIMD yet)
Back in the benchmarks
directory:
Compiler optimization barriers are hard to get right. For the rest of the exercises today, please use
vir::fake_read(...)
andvir::fake_modify(...)
: vir-simd | benchmark support functions. (I’ll explain the purpose of “vir-simd” later.)
In preparation of further simplifications, let’s also switch to the
"benchmark.h"
header, which defines a few helper functions and already implementsmain()
.
Your peakflop.cpp
should look like this now:
#include "benchmark.h"
#include <vir/simd_benchmarking.h>
void peak(benchmark::State &state)
{
float x = 1;
vir::fake_modify(x);
for (auto _ : state) {
x = x * 3 + 1;
vir::fake_read(x);
}
// compute FLOP/s and FLOP/cycle
add_flop_counters(state, 2);
}
// Register the function as a benchmark
BENCHMARK(peak);
On an Intel CPU you will likely see 0.5 FLOP/cycle now. That’s quite far away from the 4 FLOP/cycle my slides showed for scalars. Where’s the factor of 8 hiding?
TIP
addss
andmulss
require more than 1 clock cycle to produce a result; rather on the order of 3–5 cycles.Intel Optimization Reference Manual (pages 52–54)
uops.info on
mulss
(Lat: Latency — the time it takes from start of instruction execution until the result is ready. TP: Throughput — the time it takes before another instruction can be executed. time = clock cycles. TP 0.5 means the CPU can execute two of these instructions per clock cycle.)
Edit
peak
to issue more independentaddss
andmulss
instructions per loop iteration.
You should be able to reach 2 FLOP/cycle now. What is the remaining factor of 2?
TIP
Our example computes a multiplication with subsequent addition. So why doesn’t the compiler emit a fused multiply-add (FMA) instruction?
TIP
GCC Optimize Options:
-ffp-contract
Note the defaults.
-ffp-contract
defaults to producing FMAs already. However, the compiler assumes your code should be able to run on the original x86_64 AMD Athlon CPU. That CPU didn’t have FMA instructions.
Compile with
-march=native
. ModifyCMakeLists.txt
and changetarget_compile_options
from"-std=gnu++2b"
to"-std=gnu++2b;-march=native"
.make run_peakflop
Outlook
We will come back to this benchmark after we covered vectorization and
simd<T>
.