Write a benchmark (part 2)

(still no SIMD yet)

Back in the benchmarks directory:

:pencil:

Compiler optimization barriers are hard to get right. For the rest of the exercises today, please use vir::fake_read(...) and vir::fake_modify(...): :green_book: vir-simd | benchmark support functions. (I’ll explain the purpose of “vir-simd” later.)

:pencil:

In preparation of further simplifications, let’s also switch to the "benchmark.h" header, which defines a few helper functions and already implements main().

Your peakflop.cpp should look like this now:

#include "benchmark.h"
#include <vir/simd_benchmarking.h>

void peak(benchmark::State &state)
{
  float x = 1;
  vir::fake_modify(x);
  for (auto _ : state) {
    x = x * 3 + 1;
    vir::fake_read(x);
  }

  // compute FLOP/s and FLOP/cycle
  add_flop_counters(state, 2);
}

// Register the function as a benchmark
BENCHMARK(peak);

:question:

On an Intel CPU you will likely see 0.5 FLOP/cycle now. That’s quite far away from the 4 FLOP/cycle my slides showed for scalars. Where’s the factor of 8 hiding?

TIP

addss and mulss require more than 1 clock cycle to produce a result; rather on the order of 3–5 cycles.

:green_book: Intel Optimization Reference Manual (pages 52–54)

:green_book: uops.info on mulss (Lat: Latency — the time it takes from start of instruction execution until the result is ready. TP: Throughput — the time it takes before another instruction can be executed. time = clock cycles. TP 0.5 means the CPU can execute two of these instructions per clock cycle.)

:pencil:

Edit peak to issue more independent addss and mulss instructions per loop iteration.

:question:

You should be able to reach 2 FLOP/cycle now. What is the remaining factor of 2?

TIP

:green_book: uops.info on vfmadd132ss

:pencil:

Our example computes a multiplication with subsequent addition. So why doesn’t the compiler emit a fused multiply-add (FMA) instruction?

TIP

:green_book: GCC x86 Options

:green_book: GCC Optimize Options: -ffp-contract

Note the defaults. -ffp-contract defaults to producing FMAs already. However, the compiler assumes your code should be able to run on the original x86_64 AMD Athlon CPU. :flushed: That CPU didn’t have FMA instructions.

:pencil:

Compile with -march=native. Modify CMakeLists.txt and change target_compile_options from "-std=gnu++2b" to "-std=gnu++2b;-march=native".

make run_peakflop

Outlook

We will come back to this benchmark after we covered vectorization and simd<T>.

results matching ""

    No results matching ""