Compiler Explorer

  • Compiler Explorer, or similar inspection tools are an invaluable tool for optimization and micro-benchmarking.

  • An alternative you might find interesting is the use of local tools to the same end. If you’re not familiar with the necessary tools you can use two of the scripts that I created for my own use: vir_inspect.sh (requires zsh sudo apt install zsh) and vir_dump_asm.sh.

    • vir_inspect.sh /path/to/executable
      

      shows a filtered list of functions in the executable. Call

      vir_inspect.sh /path/to/executable <pattern>
      

      and it will filter the list of functions using the last argument. If a single function remains it skips the next step.

    • Enter the number of the function you want to inspect.

    • The tool will show a disassembly of the function. If debug information is available (compiled with -g), source code annotation will be shown.

    • After the disassembly, llvm-mca will interpret the complete function. This is often not very useful, unless the function was carefully crafted to be interpreted by llvm-mca. But feel free to extend the script to insert # LLVM-MCA-BEGIN name0 and # LLVM-MCA-END name0 markers before feeding into llvm-mca.

    • vir_dump_asm.sh <source file> will compile and dump asm.
  • Another alternative for Vim users: I hacked up a Compiler Explorer-like vim plugin for myself. It’s available at vim-compilerexplorer.

  • When looking at x86 asm, I recommend to use Intel syntax instead of AT&T assembler syntax. (Makes it easier when consulting Intel documentation.)

  • Quick x86 asm Introduction (by Matt Godbolt):
    • Registers
      • rax, rbx, rcx, rdx, rsi, rdi, rbp, rip, rsp, r8–r15
      • xmm0xmm15
      • rdi, rsi, rdx, … as function arguments
      • rax is the return value
    • op (often implicit src/dest)
    • op dest (often in/out and implicit src)
    • op dest, src (often in/out dest)
    • op dest, src1, src2
    • mov eax, edi “move” (eax = edi)
    • mov eax, DWORD PTR[rdi+rsi*4] “load from memory” (eax = *(int*)(rdi + rsi * 4))
    • lea eax, [rdi+rsi] “load effective address” (eax = rdi + rsi)
  • Interesting floating-point instructions:
    • All of these instructions may have a v prefix (e.g. vmovss instead of movss), which you can ignore. It’s only a different instruction encoding.
    • movss: move scalar single-precision (op1 = op2)
    • addss: add scalar single-precision (op1 += op2 or op1 = op2 + op3)
    • fmadd132ss: fused multiply-add 132 (argument order: op1 = op1 * op3 + op2) scalar single-precision
    • movd: move doubleword (32 bits) (op1 = op2)
    • movsd: move scalar double-precision
    • addsd: add scalar double-precision
  • Later we will also see instructions that use packed instead of scalar in their mnemonic. E.g. addps instead of addss. “packed” means SIMD.

Exercise

:pencil:

Inspect the example we benchmarked using Compiler Explorer. Remove FLOP/s computation. Short link

TIP

Use e.g. std::vector<int> in place of benchmark::State.

:pencil:

Modify the benchmark to produce believable results.

TIP

:green_book: benchmark::DoNotOptimize(x)

:pencil:

Or invoke some magic :magic_wand: yourself:

asm volatile("" : "+x"(x));

It is different from benchmark::DoNotOptimize. Is it better? more correct? Discuss.

TIP Local “Compiler Explorer”

Of course you can achieve a very similar result locally, using e.g. the following command. Compiler Explorer has the added feature of better annotation of the assembler output and easy testing of different compilers and compiler flags.

CXXFLAS=-std=gnu++2b -O2 -DNDEBUG
watch "ccache g++ $CXXFLAGS -c -S -o - -masm=intel myfile.cpp|grep -vE '^\s+\.'|c++filt"

Drop ccache if you don’t have it available. But since watch recompiles every 2s, caching recompiles of unchanged code is not such a bad idea.

results matching ""

    No results matching ""