Write a benchmark (part 2)

(still no SIMD yet)

Back in the benchmarks directory:

Compiler optimization barriers are hard to get right. For the rest of the exercises today, please use vir::fake_read(...) and vir::fake_modify(...): vir-simd | benchmark support functions. (I’ll explain the purpose of “vir-simd” later.)

In preparation of further simplifications, let’s also switch to the "benchmark.h" header, which defines a few helper functions and already implements main().

Your peakflop.cpp should look like this now:

#include "benchmark.h"
#include <vir/simd_benchmarking.h>

void peak(benchmark::State &state)
{
  float x = 1;
  vir::fake_modify(x);
  for (auto _ : state) {
    x = x * 3 + 1;
    vir::fake_read(x);
  }

  // compute FLOP/s and FLOP/cycle
  add_flop_counters(state, 2);
}

// Register the function as a benchmark
BENCHMARK(peak);

On an Intel CPU you will likely see 0.5 FLOP/cycle now. That’s quite far away from the 4 FLOP/cycle my slides showed for scalars. Where’s the factor of 8 hiding?

TIP

addss and mulss require more than 1 clock cycle to produce a result; rather on the order of 3–5 cycles.

Intel Optimization Reference Manual (pages 52–54)

uops.info on mulss (Lat: Latency — the time it takes from start of instruction execution until the result is ready. TP: Throughput — the time it takes before another instruction can be executed. time = clock cycles. TP 0.5 means the CPU can execute two of these instructions per clock cycle.)

Edit peak to issue more independent addss and mulss instructions per loop iteration.

You should be able to reach 2 FLOP/cycle now. What is the remaining factor of 2?

TIP

uops.info on vfmadd132ss

Our example computes a multiplication with subsequent addition. So why doesn’t the compiler emit a fused multiply-add (FMA) instruction?

TIP

GCC x86 Options

GCC Optimize Options: -ffp-contract

Note the defaults. -ffp-contract defaults to producing FMAs already. However, the compiler assumes your code should be able to run on the original x86_64 AMD Athlon CPU. That CPU didn’t have FMA instructions.

Compile with -march=native. Modify CMakeLists.txt and change target_compile_options from "-std=gnu++2b" to "-std=gnu++2b;-march=native".
make run_peakflop

Outlook

We will come back to this benchmark after we covered vectorization and simd<T>.

results matching ""

No results matching ""