Write a benchmark (part 2)
(still no SIMD yet)
Back in the benchmarks directory:
Compiler optimization barriers are hard to get right. For the rest of the exercises today, please use
vir::fake_read(...)andvir::fake_modify(...):vir-simd | benchmark support functions. (I’ll explain the purpose of “vir-simd” later.)
In preparation of further simplifications, let’s also switch to the
"benchmark.h"header, which defines a few helper functions and already implementsmain().
Your peakflop.cpp should look like this now:
#include "benchmark.h"
#include <vir/simd_benchmarking.h>
void peak(benchmark::State &state)
{
float x = 1;
vir::fake_modify(x);
for (auto _ : state) {
x = x * 3 + 1;
vir::fake_read(x);
}
// compute FLOP/s and FLOP/cycle
add_flop_counters(state, 2);
}
// Register the function as a benchmark
BENCHMARK(peak);
On an Intel CPU you will likely see 0.5 FLOP/cycle now. That’s quite far away from the 4 FLOP/cycle my slides showed for scalars. Where’s the factor of 8 hiding?
TIP
addssandmulssrequire more than 1 clock cycle to produce a result; rather on the order of 3–5 cycles.
Intel Optimization Reference Manual (pages 52–54)
uops.info on
mulss(Lat: Latency — the time it takes from start of instruction execution until the result is ready. TP: Throughput — the time it takes before another instruction can be executed. time = clock cycles. TP 0.5 means the CPU can execute two of these instructions per clock cycle.)
Edit
peakto issue more independentaddssandmulssinstructions per loop iteration.
You should be able to reach 2 FLOP/cycle now. What is the remaining factor of 2?
TIP
Our example computes a multiplication with subsequent addition. So why doesn’t the compiler emit a fused multiply-add (FMA) instruction?
TIP
GCC Optimize Options:
-ffp-contractNote the defaults.
-ffp-contractdefaults to producing FMAs already. However, the compiler assumes your code should be able to run on the original x86_64 AMD Athlon CPU.That CPU didn’t have FMA instructions.
Compile with
-march=native. ModifyCMakeLists.txtand changetarget_compile_optionsfrom"-std=gnu++2b"to"-std=gnu++2b;-march=native".make run_peakflop
Outlook
We will come back to this benchmark after we covered vectorization and
simd<T>.