HPC Benchmarks

HPC
Published

February 20, 2023

Modified

October 26, 2023

Micro-benchmarks for performance…

Fabric

  • MPI-GRAPH
    • …explores the bandwidth between possible MPI process pairs
    • …Perl script to parse application …generate HTML report
  • b_eff
    • …created a ring of nodes
    • …each node sends traffic of different sizes to neighbors

STREAM

STREAM benchmark …https://www.cs.virginia.edu/stream

Usage

Source code…

# get the source code
git clone https://github.com/jeffhammond/STREAM && cd STREAM

# compile with OpenMP support for multi-core support
gcc -fopenmp stream.c -o stream

# execute benchmark
export OMP_NUM_THREADS=2 ; ./stream

References…

Measurements

Uses synthetic vector style applications…

  • …only measures execution time …everything else is derived
  • …reports “bandwidth” values for each of the kernels
# example output
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10917.2     0.014719     0.014656     0.014961
Scale:          10629.1     0.015092     0.015053     0.015121
Add:            14149.2     0.017029     0.016962     0.017103
Triad:          13763.1     0.017509     0.017438     0.017655
Name Kernel Bytes/Iteration FLOPS/Iteration
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2
  • copy …measures transfer rate in the absence of arithmetic
  • scale …adds a simple arithmetic operation
  • sum …adds a third operand
  • triad …overlapped multiple add operations

Adjust the value of STREAM_ARRAY_SIZE

  • …number of array elements used to run the benchmarks
  • …depends on…
    • …system cache size(s)
    • …granularity of the system timer
  • …adjust value…
      1. …array…4 times the size of the available cache
      1. …large enough for ‘timing calibration’ of at least 20 clock-ticks

Use lstopo to identify L3 cache size… (multiply by 4)…

# set at compile time
gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M

IOR

IOR (Interleaved or Random) file system benchmarking application

http://wiki.lustre.org/IOR
https://github.com/LLNL/ior (deprecated)
https://github.com/IOR-LANL/ior
https://github.com/glennklockwood/ior-apex

  • Tests performance of parallel file-systems (like Lustre)
  • Use MPI for process synchronisation
  • Configurable to operate in multiple modes:
    • File-per-process: One file per task (measures peak throughput).
    • Single-shared-file: Single shared file for all tasks.
    • Buffered: Take advantage to I/O caches on the client.
    • DirectIO: Bypass I/O cache by writing directly to the file-system.
>>> git clone https://github.com/LLNL/ior.git && cd ior
>>> ./bootstrap
>>> ./configure
>>> make clean && make

Deploy the ior binary on all nodes used for benchmarking.

# 20 parallel task writing one file each with size 100MB 
mpirun -np 20 ior -a POSIX -vwk -t100m -b100m -i 10 -F -o ior.dat

Options

File size (1.5x total main memory of a node):

filesize = segmentCount * blocksize * number_of_processes
  • transfersize: Size (in bytes) of a single data buffer to be transferred in a single I/O call.
  • blocksize: Size (in bytes) of a contiguous chunk of data accessed by a single client
  • segmentCount: Number of segments in file. (A segment is a contiguous chunk of data accessed by multiple clients each writing/reading their own contiguous data; comprised of blocks accessed by multiple clients or more transfers.)

Configuration Files

>>> cat ior.conf    
IOR START
  api=MPIIO
  testFile=ior.dat
  repetitions=1
  readFile=1
  writeFile=1
  filePerProc=0
  keepFile=0
  blockSize=1024M
  transferSize=2M
  verbose=0
  numTasks=0
  collective=1
IOR STOP
>>> ior -f ior.conf

HEPScore

HEPScore23 …replaces HEPSPEC06

References…

References

Regression tests and benchmarks for HPC systems…