HPC Benchmarks

HPC

Published

February 20, 2023

Modified

October 26, 2023

Micro-benchmarks for performance…

…mini-applications to heavily test a specific function
…tries to reach performance limitations

Fabric

MPI-GRAPH
- …explores the bandwidth between possible MPI process pairs
- …Perl script to parse application …generate HTML report
b_eff
- …created a ring of nodes
- …each node sends traffic of different sizes to neighbors

STREAM

STREAM benchmark …https://www.cs.virginia.edu/stream

…measures sustainable memory bandwidth…
…works with datasets larger than the available cache
List of results…
- https://openbenchmarking.org/test/pts/stream

Usage

Source code…

# get the source code
git clone https://github.com/jeffhammond/STREAM && cd STREAM

# compile with OpenMP support for multi-core support
gcc -fopenmp stream.c -o stream

# execute benchmark
export OMP_NUM_THREADS=2 ; ./stream

References…

Intel …ICC compiler
- https://github.com/intel/memory-bandwidth-benchmarks
AMD …AOCC compiler
- https://developer.amd.com/spack/stream-benchmark/

Measurements

Uses synthetic vector style applications…

…only measures execution time …everything else is derived
…reports “bandwidth” values for each of the kernels…

# example output
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10917.2     0.014719     0.014656     0.014961
Scale:          10629.1     0.015092     0.015053     0.015121
Add:            14149.2     0.017029     0.016962     0.017103
Triad:          13763.1     0.017509     0.017438     0.017655

Name	Kernel	Bytes/Iteration	FLOPS/Iteration
COPY	`a(i) = b(i)`	16	0
SCALE	`a(i) = q*b(i)`	16	1
SUM	`a(i) = b(i) + c(i)`	24	1
TRIAD	`a(i) = b(i) + q*c(i)`	24	2

copy …measures transfer rate in the absence of arithmetic
scale …adds a simple arithmetic operation
sum …adds a third operand
triad …overlapped multiple add operations

Adjust the value of STREAM_ARRAY_SIZE…

…number of array elements used to run the benchmarks
…depends on…
- …system cache size(s)
- …granularity of the system timer
…adjust value…
- 1. …array…4 times the size of the available cache
- 1. …large enough for ‘timing calibration’ of at least 20 clock-ticks

Use lstopo to identify L3 cache size… (multiply by 4)…

# set at compile time
gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M

IOR

IOR (Interleaved or Random) file system benchmarking application

http://wiki.lustre.org/IOR
https://github.com/LLNL/ior (deprecated)
https://github.com/IOR-LANL/ior
https://github.com/glennklockwood/ior-apex

Tests performance of parallel file-systems (like Lustre)
Use MPI for process synchronisation
Configurable to operate in multiple modes:
- File-per-process: One file per task (measures peak throughput).
- Single-shared-file: Single shared file for all tasks.
- Buffered: Take advantage to I/O caches on the client.
- DirectIO: Bypass I/O cache by writing directly to the file-system.

>>> git clone https://github.com/LLNL/ior.git && cd ior
>>> ./bootstrap
>>> ./configure
>>> make clean && make

Deploy the ior binary on all nodes used for benchmarking.

# 20 parallel task writing one file each with size 100MB 
mpirun -np 20 ior -a POSIX -vwk -t100m -b100m -i 10 -F -o ior.dat

Options

File size (1.5x total main memory of a node):

filesize = segmentCount * blocksize * number_of_processes

transfersize: Size (in bytes) of a single data buffer to be transferred in a single I/O call.
blocksize: Size (in bytes) of a contiguous chunk of data accessed by a single client
segmentCount: Number of segments in file. (A segment is a contiguous chunk of data accessed by multiple clients each writing/reading their own contiguous data; comprised of blocks accessed by multiple clients or more transfers.)

Configuration Files

>>> cat ior.conf    
IOR START
  api=MPIIO
  testFile=ior.dat
  repetitions=1
  readFile=1
  writeFile=1
  filePerProc=0
  keepFile=0
  blockSize=1024M
  transferSize=2M
  verbose=0
  numTasks=0
  collective=1
IOR STOP
>>> ior -f ior.conf

HEPScore

HEPScore23 …replaces HEPSPEC06

…WLCG community in favour an open source benchmark…
- …over a SPEC-CPU 2006 based benchmark requiring a licence
- …support benchmark for other processors (ARM and GPUs)
…provided to the HEPiX Benchmark Working Group
- …in the HEP Benchmark Suite repository
- …results collected in a central scores table

References…

Power Efficiency in HEP (a case between ARM and x86), ACAT 2022
HEPiX Benchmarking Working Group Report, HEPiX Fall 2023

References

Regression tests and benchmarks for HPC systems…

PVCS (Parallel Computing Validation System)
ReFrame
- https://reframe-hpc.readthedocs.io
- https://github.com/reframe-hpc/reframe
JuBE
- http://www.fz-juelich.de/jsc/jube
- https://github.com/edf-hpc/jube
Pavilion2
- https://pavilion2.readthedocs.io
- https://github.com/hpc/pavilion2

--- title: HPC Benchmarks categories: - HPC date: 2023/02/20 date-modified: 2023/10/26 --- Micro-benchmarks for performance... - ...mini-applications to heavily test a specific function - ...tries to reach performance limitations # Fabric - [MPI-GRAPH](https://github.com/LLNL/mpiGraph) - ...explores the bandwidth between possible MPI process pairs - ...Perl script to parse application ...generate HTML report - [`b_eff`](https://fs.hlrs.de/projects/par/mpi/b_eff/) - ...created a ring of nodes - ...each node sends traffic of different sizes to neighbors # STREAM STREAM benchmark ...<https://www.cs.virginia.edu/stream> * ...measures **sustainable memory bandwidth**... * ...works with datasets larger than the available cache * List of results... * <https://openbenchmarking.org/test/pts/stream> ## Usage Source code... * <https://www.cs.virginia.edu/stream/FTP/Code/> * <https://github.com/jeffhammond/STREAM> ```sh # get the source code git clone https://github.com/jeffhammond/STREAM && cd STREAM # compile with OpenMP support for multi-core support gcc -fopenmp stream.c -o stream # execute benchmark export OMP_NUM_THREADS=2 ; ./stream ``` References... * Intel ...ICC compiler * <https://github.com/intel/memory-bandwidth-benchmarks> * AMD ...AOCC compiler * <https://developer.amd.com/spack/stream-benchmark/> ## Measurements Uses synthetic vector style applications... * ...only measures execution time ...everything else is derived * ...reports "bandwidth" values for each of the **kernels**... ```sh # example output Function Best Rate MB/s Avg time Min time Max time Copy: 10917.2 0.014719 0.014656 0.014961 Scale: 10629.1 0.015092 0.015053 0.015121 Add: 14149.2 0.017029 0.016962 0.017103 Triad: 13763.1 0.017509 0.017438 0.017655 ``` Name | Kernel | Bytes/Iteration | FLOPS/Iteration ------|------------------------|-----------------|------------------ COPY | `a(i) = b(i)` | 16 | 0 SCALE | `a(i) = q*b(i)` | 16 | 1 SUM | `a(i) = b(i) + c(i)` | 24 | 1 TRIAD | `a(i) = b(i) + q*c(i)` | 24 | 2 * copy ...measures transfer rate in the absence of arithmetic * scale ...adds a simple arithmetic operation * sum ...adds a third operand * triad ...overlapped multiple add operations **Adjust the value of `STREAM_ARRAY_SIZE`...** * ...number of array elements used to run the benchmarks * ...depends on... * ...system cache size(s) * ...granularity of the system timer * ...adjust value... * (a) ...array...4 times the size of the available cache * (b) ...large enough for ‘timing calibration’ of at least 20 clock-ticks Use `lstopo` to identify L3 cache size... (multiply by 4)... ```sh # set at compile time gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M ``` # IOR IOR (Interleaved or Random) file system benchmarking application <http://wiki.lustre.org/IOR> <https://github.com/LLNL/ior> (deprecated) <https://github.com/IOR-LANL/ior> <https://github.com/glennklockwood/ior-apex> * Tests performance of parallel file-systems (like Lustre) * Use MPI for process synchronisation * Configurable to operate in multiple modes: - **File-per-process**: One file per task (measures peak throughput). - **Single-shared-file**: Single shared file for all tasks. - **Buffered**: Take advantage to I/O caches on the client. - **DirectIO**: Bypass I/O cache by writing directly to the file-system. ```bash >>> git clone https://github.com/LLNL/ior.git && cd ior >>> ./bootstrap >>> ./configure >>> make clean && make ``` Deploy the `ior` binary on all nodes used for benchmarking. ```bash # 20 parallel task writing one file each with size 100MB mpirun -np 20 ior -a POSIX -vwk -t100m -b100m -i 10 -F -o ior.dat ``` ### Options File size (1.5x total main memory of a node): filesize = segmentCount * blocksize * number_of_processes * `transfersize`: Size (in bytes) of a single data buffer to be transferred in a single I/O call. * `blocksize`: Size (in bytes) of a contiguous chunk of data accessed by a single client * `segmentCount`: Number of segments in file. (A segment is a contiguous chunk of **data accessed by multiple clients** each writing/reading their own contiguous data; comprised of blocks accessed by multiple clients or more transfers.) ### Configuration Files ```bash >>> cat ior.conf IOR START api=MPIIO testFile=ior.dat repetitions=1 readFile=1 writeFile=1 filePerProc=0 keepFile=0 blockSize=1024M transferSize=2M verbose=0 numTasks=0 collective=1 IOR STOP >>> ior -f ior.conf ``` # HEPScore [HEPScore23][uelCO] ...replaces [HEPSPEC06][WFiWI] [uelCO]: https://w3.hepix.org/benchmarking/how_to_run_HS23.html [WFiWI]: https://w3.hepix.org/benchmarking/HS06.html - ...WLCG community in favour an open source benchmark... - ...over a SPEC-CPU 2006 based benchmark requiring a licence - ...support benchmark for other processors (ARM and GPUs) - ...provided to the [HEPiX Benchmark Working Group][Rt50v] - ...in the [HEP Benchmark Suite][PmcbK] repository - ...results collected in a [central scores table][wJ3vs] [Rt50v]: https://w3.hepix.org/benchmarking.html [PmcbK]: https://gitlab.cern.ch/hep-benchmarks/hep-benchmark-suite [wJ3vs]: https://w3.hepix.org/benchmarking/scores_HS23.html References... - [Power Efficiency in HEP (a case between ARM and x86)][ELfAy], ACAT 2022 - [HEPiX Benchmarking Working Group Report][FdtSE], HEPiX Fall 2023 [qL1QJ]: https://www.researchgate.net/publication/371605303_HEPScore_A_new_CPU_benchmark_for_the_WLCG [ELfAy]: https://indico.cern.ch/event/1106990/contributions/4991256/attachments/2534801/4362468/PoW_ACAT2022.pdf [FdtSE]: https://indico.cern.ch/event/1289243/contributions/5583089/attachments/2735171/4756047/benchmarking_fall2023.pdf # References Regression tests and benchmarks for HPC systems... * PVCS (Parallel Computing Validation System) * <https://pcvs.hpcframework.com/> * <https://pcvs.readthedocs.io/> * <https://github.com/cea-hpc/pcvs> * <https://github.com/cea-hpc/pcvs-benchmarks> * ReFrame * <https://reframe-hpc.readthedocs.io> * <https://github.com/reframe-hpc/reframe> * JuBE * <http://www.fz-juelich.de/jsc/jube> * <https://github.com/edf-hpc/jube> * Pavilion2 * <https://pavilion2.readthedocs.io> * <https://github.com/hpc/pavilion2>