HPC — GPU Accelerators

Hardware

HPC

Published

January 16, 2023

Modified

July 16, 2025

GPUs (graphics processing units) in HPC…

…offer a parallel architecture to speed up certain computing processes…
- …large number of independent operations on highly structured data
- …especially those related to artificial intelligence (AI) and machine learning (ML) models
…enables processing of applications…
- …with higher efficiency
- …with less power consumption
- …therefore at a lower cost

Motivation

GPUs performance improves faster then CPUs…

…driven by demand for the video game marked
…specialised architectures simplifies scaling of transistors
Modern GPUs…
- …highly programmable
- …mature high-level language support
- …support for 32/64 bit floating point arithmetic

GPUs vs CPUs

CPUs advantage…

…large main memory (RAM)
- …latency optimized by large caches
- …designed with random access in mind
…small number of threads can run very quickly
…features for fast synchronisation on multi-core
Disadvantage…
- …relatively low memory bandwidth
- …limited number of cores (in comparison)
- …low performance per watt (compared to GPUs)

GPUs advantage…

…high throughput on structured data
…high bandwidth main memory (HBM)
…scalar instructions (inherently parallel)
…significantly more compute resources
Disadvantage…
- …data movement explicit
- …smaller memory capacity (compared to main RAM)
- …low per-thread performance

Applications

Typical computation not suitable to GPUs…

…highly serial algorithms …no inherent parallelism
…strongly memory bound computations
- …large data set …small number of operations per data set
- …consider memory access costs
- …unless CPU/GPU have shared memory
…highly unstructured data
- …complex flow of the computation
- …high frequency of data access barriers

Terminology

GPGPUs (General Purpose Graphical Processing Units

…many-core (very many simple cores)
…massive parallelism …lots of concurrent threads
…different programming mode emerged to use GPU for data processing
- …allows software to use GPUs for general purpose processing
…contribute to more energy efficiency

Compute kernel (aka GPU kernel)…

…not to be confuses with a OS kernel
…code compiled for high throughput accelerators
…use execution units with vertex shaders and pixel shaders on GPUs
Keywords…
- …thread …single computational task on the GPU kernel
- …thread block …group of threads in the same location on the GPU
- …grid …collection of thread blocks

HSA (Heterogeneous System Architecture)…

…cross-vendor (mostly AMD) …specifications by the HSA Foundation
…integrate CPUs & GPUs on single bus (shared memory)
…reduce communication latency between CPUs & GPUs
…relieving programmer from moving of data between devices

Hardware

Basic GPU architecture

…100+ cores on a single GPU chip
…each core works multiple threads of instructions
…possible to run 1000+ threads in parallel
Memory…
- …per-thread (private) local memory
- …per thread-block shared memory
- …grids of thread-blocks share a global memory (per application context)

Identify a GPU device on a host…

>>> lspci | grep -i display
63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 #...

Form Factor

Dual-slot PCIe card
- …high-end cards typically full-height, full-length (FHFL)
- …with axial active-cooling …pushes air to the backside of the chassis
OAM (OCP Accelerator Module)…
- …defines form factor & specifications for a compute accelerator module
- …contrast with a PCIe add-in card form factor
- …simplifying interconnecting high-speed communication links among modules
SXM (Scalable Link Interface for eXternal Memory)
- …SXM-modules sit directly on top of a motherboard …use a dedicated connector
- …higher memory capacity and bandwidth …less latency

Server chassis facilitate 2x, 4x, or 8x GPUs

…most common are 4x GPU systems
- …due to thermal and power boundaries by the infrastructure
- …limits to the required PCIe lanes (typically 16x per GPU)
…multi-GPU nodes do not mix different GPU models

AMD

CDNA …GPU architecture for data center

Microarchitecture	Release	Accelerator	VRAM	Form Factor	F64 Flops	TDP
CDNA3	2024	MI300	128GB
CDNA2	2021	MI250	128GB	OAM	45.3T	560W
	2022	MI210	64GB	OAM/PCIe	22.6T	300W
CDNA	2020	MI100	32GB	PCIe	11.5T	300W
GCN5	2018	MI60	32GB	PCIe	7.4T	300W
	2017	MI50	16GB	PCIe	6.6T	300W

Infinity Fabric… coherent memory space shared by CPU & GPU

…unified memory …eliminate redundant memory copies
…no additional main memory required
…dynamic memory allocation between CPU & GPU

Drivers

AMDGPU stack repository

…supports RHEL, SLES, Ubuntu
…amdgpu-* packages include the kernel-mode driver.
…tools are available from the ROCm stack repository

RHEL	Release	ROCm Drivers
7.9 8.6 8.7 9.0	2022/11	22.20.5
7.9 8.6 9.0	2022/06	22.10.4
7.9 8.4 8.5	2022/02	21.50.2

# ...install dependencies
dnf install -y \
      kernel-headers-`uname -r` kernel-devel-`uname -r` dkms \
      autoconf automake m4 perl-Thread-Queue

# install kernel modules and hardware tools
dnf install -y \
      dkms amdgpu-dkms-firmware amdgpu-dkms-headers \
      rocm-core rocm-smi-lib rocminfo hsa-rocr \
      comgr rocm-opencl orcm-ocl-icd

Use lsmod to look for the loaded kernel modules…

lsmod | grep amdgpu
amdgpu               9789440  0
amddrm_ttm_helper      16384  1 amdgpu
amdttm                 81920  2 amdgpu,amddrm_ttm_helper
iommu_v2               20480  1 amdgpu
amd_sched              40960  1 amdgpu
amdkcl                 28672  3 amd_sched,amdttm,amdgpu
i2c_algo_bit           16384  3 igb,ast,amdgpu
drm_kms_helper        270336  5 drm_vram_helper,ast,amdgpu
drm                   589824  10 drm_kms_helper,amd_sched,amdttm,drm_vram_helper #...

`rocm-smi`

SMI (system management interface)…

…documented at ROCm deployment tools
..clock and temperature management
…exposed by the rocm-smi command

# ...no flags/options
>>> /opt/rocm/bin/rocm-smi 
#...
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    39.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
1    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
2    41.0c           36.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
3    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
4    38.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
5    38.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
6    37.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
7    39.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
#...

Utilization per package
- …-u …current GPU use (in percent)
- …--showmemuse …memory used (in percent)
- …--showvoltage …show voltage
-P …current average graphics package power consumption in Watts
-f …fan speed
--showtemp …insight into the system health
- …edge temperature …most recently measured temperature
- …junction hot-spot temperature …highest temperature value of all sensors
- …memory temperature …hottest HBM stack

>>> rocm-smi --showtemp
#....
GPU[0]          : Temperature (Sensor edge) (C): 39.0
GPU[0]          : Temperature (Sensor junction) (C): 41.0
GPU[0]          : Temperature (Sensor memory) (C): 38.0

`rvs`

RVS (ROCm Validation Suite)…

…documentation rocmdocs.amd.com
- …source code on GitHub
- …install the rocm-validation-suite package from the ROCm repository
…detect/troubleshot common problems affecting AMD GPUs
…tests, benchmarks, and qualification tools
- …tests is implemented in a modules
- …each module has a dedicated set of options and a configuration file

Options …-g list GPU devices

# list GPUs
>>> rvs -g
#...
Supported GPUs available:
0000:63:00.0 - GPU[ 2 - 44650]  (Device 29580)
0000:43:00.0 - GPU[ 3 - 59802]  (Device 29580)
0000:03:00.0 - GPU[ 4 - 23480]  (Device 29580)
0000:27:00.0 - GPU[ 5 - 39789]  (Device 29580)
0000:E3:00.0 - GPU[ 6 - 51758]  (Device 29580)
0000:C3:00.0 - GPU[ 7 -  1375]  (Device 29580)
0000:83:00.0 - GPU[ 8 - 30589]  (Device 29580)
0000:A3:00.0 - GPU[ 9 - 15436]  (Device 29580)

Option …-t list test modules

gpup …queries the configuration of a target device
gm …GPU monitoring tool
gst …GPU stress test
pesm …PCIe state monitor
pbqt …list of all GPUs that support peer-2-peer
peqt …qualify the PCIe bus on which the GPU is connected
pebb …PCIe bandwidth benchmark

`rdc`

ROCm™ Data Center Tool™ (RDC)…

…documented at ROCm deployment tools
…rdc package in the ROCm stack repository
…telemetry and diagnostics
- …provide Python bindings
- …includes a Prometheus and Grafana plugins
Two operation modes…
- …standalone …rdcd (daemon) runs on each compute node
- …embedded …interface for user monitoring agents

Intel

Intel Xeon Phi (Knights Landing) discontinued in 2016

Intel Xe data-center GPUs…

Microarchitecture	Release	Accelerator	VRAM	From Factor	TDP
Rialto Bridge	?	?	?	?
Ponte Vecchio	2023	Max 1100	48GB	PCIe/OAM	300W
	2023	Max 1350	96GB	OAM	450W
W2s1	2023	Max 1550	128GB	OAM	600W

Nvidia

…data-center GPUs (formally Tesla)

Microarchitecture	Release	Accelerator	VRAM	Form Factor	GPU/Tensor Cores	TDP
Blackwell	2025	B200	180GB
Grace Hopper	2024	H200	141GB (HBM2)	PCIe/SXM
2023	H100	80GB (HBM2)	PCIe	14592/456	350W
Ampere	2020	A100	80GB (HBM2)	PCIe	6912/512	400W
	2020	A100	40GB (HBM2)	PCIe	6912/512	400W
Volta	2017	V100	32GB (HBM2)	PCIe	5120/640	350W
	2017	V100	16GB (HBM2)	PCIe	5120/640	300W
Pascal	2016	P100	16GB	PCIe		300W
Kepler	2014	K40	12GB	PCIe		235W
	2012	K20	5GB	PCIe		235W

Notable features¹ ²…

RDMA via GPUDirect
- …allows other devices (e.g. InfiniBand) access to memory
- …improves MPI latency for send/receive to GPU memory
NVLink …high speed interconnect
- …connects GPUS with higher bandwidth then PCIe
- …supports shared memory across GPUs
- …integrates with NVLink-enabled CPUs

MIG

MIG³: — Multi-Instance GPU (Ampere+)

…multi-user partitions for GPUs
…workloads in parallel to maximize utilization
Up to seven isolated compute & memory instances (virtual GPU)
- …separate paths through the entire memory system
- …workload can run with predictable throughput and latency

Terminology…

GPU Instance (GI) is a combination of…
- GPU slices …smallest fraction of a GPU, combines…
  - …single GPU memory slice
  - …single GPU SM (streaming multiprocessor) slice
- GPU engines …executes work on the GPU
…can be subdivided into multiple Compute Instances (CI)
Partitions based naming convention⁴ for slices & device names

Platforms

CUDA (Nvidia)

CUDA…

…used from C/C++, Fortran, Python, Matlab, Julia, and others
…large ecosystem of GPU computing libraries that are built on CUDA
…depends proprietary drivers …restricted to NVIDIA hardware

Access to GPU computing from Python…

CUDA…
- PyCUDA
- Numba (NVIDIA)

ROCm (AMD)

ROC (Radeon Open Compute) …software development platform for HPC GPU computing

…documentation at docs.amd.com
- …deprecated rocmdocs.amd.com
- …examples at github.com/amd/rocm-examples

Add the ROCm repository …adjust baseurl accordingly…

cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

ROCm stack repository provide binary packages
…meta packages…
- rocm-hip-runtime …for applications implementing AMD HIP
- rocm-hip-sdk …HIP development environment
- rocm-opencl-runtime …run OpenCL based applications
- rocm-opencl-sdk …OpenCL development evironment
…installs to /opt/rocm

Containers

…containers on DockerHub
- …example Dockerfiles
- …host needs host needs the ROCm kernel rocm-core with kernel modules
- …Apptainer support for AMD GPUs & ROCm

Apptainer definition for a container with ROCm SDK…

# vim: ft=bash
BootStrap: docker
From: quay.io/rockylinux/rockylinux:8

%labels
Author Victor Penso

%post
cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

dnf install -y wget gawk curl epel-release dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf install -y rocm-hip-sdk rocm-opencl-sdk rocm-validation-suite
dnf clean all

echo 'export PATH=/opt/rocm/bin:$PATH' > /etc/profile.d/rocm.sh

%runscript
if ! [ $# -gt 0 ]
then
      /bin/bash --rcfile /etc/profile -l
else
      /bin/bash --rcfile /etc/profile -l -c "$@"
fi

# build the container definition above
export APPTAINER_CONTAINER=$LUSTRE_HOME/containers/rocm-5.4.3.sif
apptainer build $APPTAINER_CONTAINER apptainer.def
# request the allocation of GPU
salloc --partition gpu --gres=gpu:1 
# start an interactive container with GPU support...
srun --pty -- apptainer run --rocm $APPTAINER_CONTAINER

Usage

rocminfo …enumerate GPU agents available on a working ROCm stack

Frameworks

OpenCL

OpenCL…

…open heterogeneous computing standard from the Khronos Group
…supports GPUs, CPUs and FPGAs …widely used, but less common in HPC
Support
- …Nvidia OpenCL SDK
- …AMD ROCm OpenCL Runtime
- …Spack rocm-opensl package
- …Python binding PyOpenCL

SYCL

SYCL (pronounced ‘sickle’)

…modern heterogeneous compute standard from the Khronos Group
…supports simultaneous use of CPUs, GPUs, and FPGAs
…compiler optimizes code across different architectures
…similar to the programming models of CUDA & ROCm HIP
References…
- SYCL Parallel STL in C++ implementing the Khronos SYCL standard
- SYCL Specification

OpenACC

OpenACC…

…standard for compiler pragma’s that support offloading to accelerator devices
…used from C, C++ and Fortran

HIP

HIP (Heterogeneous-Computing Interface for Portability)

…AMD GPU programming environment …designing high performance kernels on GPUs
…C++ runtime API …portable code to run on AMD and NVIDIA GPUs
…layer (or wrapper) that uses the underlying ROCm or CUDA platform
…HIP similar to CUDA …virtually no performance overhead on Nvidia hardware
References…
- …HIP repository, ROCm developer tools
- …HIPFY to translate CUDA source code

Reference

P|R|K (Parallel Research Kernels)
- https://github.com/ParRes/Kernels
PyTorch-Benchmarks
- …compatible to CUDA (NVIDIA) and ROCm (AMD)
- https://github.com/aime-team/pytorch-benchmarks

Footnotes

NVIDIA Data Center Documentation
https://docs.nvidia.com/datacenter/tesla/index.html ↩︎
GPU Management and Deployment, NVIDIA Documentation
https://docs.nvidia.com/deploy/index.html ↩︎
MIG User Guide, NVIDIA Documentation
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html ↩︎
MIG Device Names, NVIDIA Documentation
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#mig-device-names ↩︎

--- title: 'HPC — GPU Accelerators' categories: - Hardware - HPC date: 2023/01/16 date-modified: 2025/07/16 --- GPUs (graphics processing units) in HPC... - ...offer a parallel architecture to speed up certain computing processes... - ...large number of independent operations on highly structured data - ...especially those related to artificial intelligence (AI) and machine learning (ML) models - ...enables processing of applications... - ...with higher efficiency - ...with less power consumption - ...therefore at a lower cost # Motivation GPUs performance improves faster then CPUs... - ...driven by demand for the video game marked - ...specialised architectures simplifies scaling of transistors - Modern GPUs... - ...highly programmable - ...mature high-level language support - ...support for 32/64 bit floating point arithmetic ## GPUs vs CPUs CPUs advantage... - ...large main memory (RAM) - ...latency optimized by large caches - ...designed with random access in mind - ...small number of threads can run very quickly - ...features for fast synchronisation on multi-core - Disadvantage... - ...relatively low memory bandwidth - ...limited number of cores (in comparison) - ...low performance per watt (compared to GPUs) GPUs advantage... - ...high throughput on structured data - ...high bandwidth main memory (HBM) - ...scalar instructions (inherently parallel) - ...significantly more compute resources - Disadvantage... - ...data movement explicit - ...smaller memory capacity (compared to main RAM) - ...low per-thread performance ## Applications Typical computation not suitable to GPUs... - ...highly serial algorithms ...no inherent parallelism - ...strongly memory bound computations - ...large data set ...small number of operations per data set - ...consider memory access costs - ...unless CPU/GPU have shared memory - ...highly unstructured data - ...complex flow of the computation - ...high frequency of data access barriers # Terminology **GPGPUs** (General Purpose Graphical Processing Units - ...many-core (very many simple cores) - ...massive parallelism ...lots of concurrent threads - ...different programming mode emerged to use GPU for data processing - ...allows software to use GPUs for general purpose processing - ...contribute to more energy efficiency Compute **kernel** (aka GPU kernel)... - ...not to be confuses with a OS kernel - ...code compiled for high throughput accelerators - ...use execution units with vertex shaders and pixel shaders on GPUs - Keywords... - ...**thread** ...single computational task on the GPU kernel - ...**thread block** ...group of threads in the same location on the GPU - ...**grid** ...collection of thread blocks **HSA** (Heterogeneous System Architecture)... - ...cross-vendor (mostly AMD) ...specifications by the [HSA Foundation](http://hsafoundation.com/) - ...integrate CPUs & GPUs on single bus (shared memory) - ...reduce communication latency between CPUs & GPUs - ...relieving programmer from moving of data between devices # Hardware Basic GPU architecture - ...100+ cores on a single GPU chip - ...each core works multiple threads of instructions - ...possible to run 1000+ threads in parallel - Memory... - ...per-thread (private) local memory - ...per thread-block shared memory - ...grids of thread-blocks share a global memory (per application context) Identify a GPU device on a host... ```sh >>> lspci | grep -i display 63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 #... ``` ## Form Factor - Dual-slot **PCIe** card - ...high-end cards typically full-height, full-length (FHFL) - ...with axial active-cooling ...pushes air to the backside of the chassis - OAM (OCP Accelerator Module)... - ...defines form factor & specifications for a compute accelerator module - ...contrast with a PCIe add-in card form factor - ...simplifying interconnecting high-speed communication links among modules - **SXM** (Scalable Link Interface for eXternal Memory) - ...SXM-modules sit directly on top of a motherboard ...use a dedicated connector - ...higher memory capacity and bandwidth ...less latency Server **chassis facilitate 2x, 4x, or 8x GPUs** - ...most common are 4x GPU systems - ...due to thermal and power boundaries by the infrastructure - ...limits to the required PCIe lanes (typically 16x per GPU) - ...multi-GPU nodes do not mix different GPU models ## AMD **CDNA** …GPU architecture for data center Microarchitecture | Release | Accelerator | VRAM | Form Factor | F64 Flops | TDP ------------------|---------|-------------|-------|-------------|-----------|------- CDNA3 | 2024 | MI300 | 128GB | | | CDNA2 | 2021 | MI250 | 128GB | OAM | 45.3T | 560W | | 2022 | MI210 | 64GB | OAM/PCIe | 22.6T | 300W CDNA | 2020 | MI100 | 32GB | PCIe | 11.5T | 300W GCN5 | 2018 | MI60 | 32GB | PCIe | 7.4T | 300W | | 2017 | MI50 | 16GB | PCIe | 6.6T | 300W Infinity Fabric… coherent memory space shared by CPU & GPU - …unified memory …eliminate redundant memory copies - …no additional main memory required - …dynamic memory allocation between CPU & GPU ### Drivers [AMDGPU stack repository](https://repo.radeon.com/amdgp) - ...supports RHEL, SLES, Ubuntu - ...`amdgpu-*` packages include the kernel-mode driver. - ...tools are available from the [ROCm stack repository](https://repo.radeon.com/rocm) RHEL | Release | ROCm Drivers -------------------|---------|------------------- 7.9 8.6 8.7 9.0 | 2022/11 | [22.20.5](https://repo.radeon.com/amdgpu/22.20.5/rhel) 7.9 8.6 9.0 | 2022/06 | [22.10.4](https://repo.radeon.com/amdgpu/22.10.4/rhel/) 7.9 8.4 8.5 | 2022/02 | [21.50.2](https://repo.radeon.com/amdgpu/21.50.2/rhel/) ```sh # ...install dependencies dnf install -y \ kernel-headers-`uname -r` kernel-devel-`uname -r` dkms \ autoconf automake m4 perl-Thread-Queue # install kernel modules and hardware tools dnf install -y \ dkms amdgpu-dkms-firmware amdgpu-dkms-headers \ rocm-core rocm-smi-lib rocminfo hsa-rocr \ comgr rocm-opencl orcm-ocl-icd ``` Use `lsmod` to look for the loaded kernel modules... ```sh lsmod | grep amdgpu amdgpu 9789440 0 amddrm_ttm_helper 16384 1 amdgpu amdttm 81920 2 amdgpu,amddrm_ttm_helper iommu_v2 20480 1 amdgpu amd_sched 40960 1 amdgpu amdkcl 28672 3 amd_sched,amdttm,amdgpu i2c_algo_bit 16384 3 igb,ast,amdgpu drm_kms_helper 270336 5 drm_vram_helper,ast,amdgpu drm 589824 10 drm_kms_helper,amd_sched,amdttm,drm_vram_helper #... ``` ### `rocm-smi` SMI (system management interface)... - ...documented at [ROCm deployment tools](https://docs.amd.com/category/deployment_tools) - ..clock and temperature management - ...exposed by the `rocm-smi` command ```sh # ...no flags/options >>> /opt/rocm/bin/rocm-smi #... GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 39.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 1 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 2 41.0c 36.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 3 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 4 38.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 5 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 6 37.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% 7 39.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0% #... ``` - Utilization per package - ...`-u` ...current GPU use (in percent) - ...`--showmemuse` ...memory used (in percent) - ...`--showvoltage` ...show voltage - `-P` ...current average graphics package power consumption in Watts - `-f` ...fan speed - `--showtemp` ...insight into the system health - ...`edge` temperature ...most recently measured temperature - ...`junction` hot-spot temperature ...highest temperature value of all sensors - ...`memory` temperature ...hottest HBM stack ```sh >>> rocm-smi --showtemp #.... GPU[0] : Temperature (Sensor edge) (C): 39.0 GPU[0] : Temperature (Sensor junction) (C): 41.0 GPU[0] : Temperature (Sensor memory) (C): 38.0 ``` ### `rvs` RVS (ROCm Validation Suite)... - ...documentation [rocmdocs.amd.com](https://rocmdocs.amd.com/en/latest/Other_Solutions/rocm-validation-suite.html) - ...[source code](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite) on GitHub - ...install the `rocm-validation-suite` package from the ROCm repository - ...detect/troubleshot common problems affecting AMD GPUs - ...tests, benchmarks, and qualification tools - ...tests is implemented in a modules - ...each module has a dedicated set of options and a configuration file Options ...`-g` list GPU devices ```sh # list GPUs >>> rvs -g #... Supported GPUs available: 0000:63:00.0 - GPU[ 2 - 44650] (Device 29580) 0000:43:00.0 - GPU[ 3 - 59802] (Device 29580) 0000:03:00.0 - GPU[ 4 - 23480] (Device 29580) 0000:27:00.0 - GPU[ 5 - 39789] (Device 29580) 0000:E3:00.0 - GPU[ 6 - 51758] (Device 29580) 0000:C3:00.0 - GPU[ 7 - 1375] (Device 29580) 0000:83:00.0 - GPU[ 8 - 30589] (Device 29580) 0000:A3:00.0 - GPU[ 9 - 15436] (Device 29580) ``` Option ...`-t` list test modules - `gpup` ...queries the configuration of a target device - `gm` ...GPU monitoring tool - `gst` ...GPU stress test - `pesm` ...PCIe state monitor - `pbqt` ...list of all GPUs that support peer-2-peer - `peqt` ...qualify the PCIe bus on which the GPU is connected - `pebb` ...PCIe bandwidth benchmark ### `rdc` ROCm™ Data Center Tool™ (RDC)... - ...documented at [ROCm deployment tools](https://docs.amd.com/category/deployment_tools) - ...`rdc` package in the [ROCm stack repository](https://repo.radeon.com/rocm) - ...telemetry and diagnostics - ...provide Python bindings - ...includes a Prometheus and Grafana plugins - Two operation modes... - ...standalone ...`rdcd` (daemon) runs on each compute node - ...embedded ...interface for user monitoring agents ## Intel Intel Xeon Phi (Knights Landing) discontinued in 2016 Intel Xe data-center GPUs... Microarchitecture | Release | Accelerator | VRAM | From Factor | TDP ------------------|---------|-------------|--------|-------------|----- Rialto Bridge | ? | ? | ? | ? | Ponte Vecchio | 2023 | Max 1100 | 48GB | PCIe/OAM | 300W | | 2023 | Max 1350 | 96GB | OAM | 450W | W2s1 | 2023 | Max 1550 | 128GB | OAM | 600W ## Nvidia ...data-center GPUs (formally Tesla) Microarchitecture | Release | Accelerator | VRAM | Form Factor | GPU/Tensor Cores | TDP ------------------|-------------|-------------|-------------|-------------|-------------------|----- Blackwell | 2025 | B200 | 180GB | | | Grace Hopper | 2024 | H200 | 141GB (HBM2)| PCIe/SXM | | | 2023 | H100 | 80GB (HBM2) | PCIe | 14592/456 | 350W Ampere | 2020 | A100 | 80GB (HBM2) | PCIe | 6912/512 | 400W | | 2020 | A100 | 40GB (HBM2) | PCIe | 6912/512 | 400W Volta | 2017 | V100 | 32GB (HBM2) | PCIe | 5120/640 | 350W | | 2017 | V100 | 16GB (HBM2) | PCIe | 5120/640 | 300W Pascal | 2016 | P100 | 16GB | PCIe | | 300W Kepler | 2014 | K40 | 12GB | PCIe | | 235W | | 2012 | K20 | 5GB | PCIe | | 235W Notable features[^Ow09S] [^JK34d]... [^Ow09S]: NVIDIA Data Center Documentation <https://docs.nvidia.com/datacenter/tesla/index.html> [^JK34d]: GPU Management and Deployment, NVIDIA Documentation <https://docs.nvidia.com/deploy/index.html> - RDMA via **GPUDirect** - ...allows other devices (e.g. InfiniBand) access to memory - ...improves MPI latency for send/receive to GPU memory - **NVLink** ...high speed interconnect - ...connects GPUS with higher bandwidth then PCIe - ...supports shared memory across GPUs - ...integrates with NVLink-enabled CPUs ### MIG **MIG**[^fG45d]: — Multi-Instance GPU (Ampere+) [^fG45d]: MIG User Guide, NVIDIA Documentation <https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html> - …multi-user partitions for GPUs - …workloads in parallel to maximize utilization - Up to seven isolated compute & memory instances (virtual GPU) - …separate paths through the entire memory system - …workload can run with predictable throughput and latency Terminology… - **GPU Instance** (GI) is a combination of… - **GPU slices** …smallest fraction of a GPU, combines… - …single GPU memory slice - …single GPU SM (streaming multiprocessor) slice - **GPU engines** …executes work on the GPU - …can be subdivided into multiple **Compute Instances** (CI) - Partitions based naming convention[^D4Fg1] for slices & device names [^D4Fg1]: MIG Device Names, NVIDIA Documentation <https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#mig-device-names> # Platforms ## CUDA (Nvidia) [CUDA](https://developer.nvidia.com/cuda-zone)... - ...used from C/C++, Fortran, Python, Matlab, Julia, and others - ...large ecosystem of GPU computing libraries that are built on CUDA - ...depends proprietary drivers ...restricted to NVIDIA hardware Access to GPU computing from Python... - CUDA... - [PyCUDA](https://documen.tician.de/pycuda/) - [Numba](https://developer.nvidia.com/blog/numba-python-cuda-acceleration/) (NVIDIA) ## ROCm (AMD) ROC (Radeon Open Compute) ...software development platform for HPC GPU computing - ...documentation at [docs.amd.com](https://docs.amd.com/) - ...deprecated [rocmdocs.amd.com](https://rocmdocs.amd.com/en/latest) - ...examples at [github.com/amd/rocm-examples](https://github.com/amd/rocm-examples) Add the ROCm repository ...adjust `baseurl` accordingly... ```sh cat > /etc/yum.repos.d/amdgpu.repo <<EOF [amdgpu] name=amdgpu baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main enabled=1 gpgcheck=1 gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key EOF ``` - [ROCm stack repository](https://repo.radeon.com/rocm/) provide binary packages - ...meta packages... - `rocm-hip-runtime` ...for applications implementing AMD HIP - `rocm-hip-sdk` ...HIP development environment - `rocm-opencl-runtime` ...run OpenCL based applications - `rocm-opencl-sdk` ...OpenCL development evironment - ...installs to `/opt/rocm` ### Containers - ...containers on [DockerHub](https://hub.docker.com/u/rocm) - ...example [Dockerfiles](https://github.com/amd/rocm-examples/tree/develop/Dockerfiles) - ...host needs host needs the ROCm kernel `rocm-core` with kernel modules - ...Apptainer support for [AMD GPUs & ROCm](https://apptainer.org/docs/user/main/gpu.html#amd-gpus-rocm) Apptainer definition for a container with ROCm SDK... ```sh # vim: ft=bash BootStrap: docker From: quay.io/rockylinux/rockylinux:8 %labels Author Victor Penso %post cat > /etc/yum.repos.d/amdgpu.repo <<EOF [amdgpu] name=amdgpu baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main enabled=1 gpgcheck=1 gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key EOF dnf install -y wget gawk curl epel-release dnf-plugins-core dnf config-manager --set-enabled powertools dnf install -y rocm-hip-sdk rocm-opencl-sdk rocm-validation-suite dnf clean all echo 'export PATH=/opt/rocm/bin:$PATH' > /etc/profile.d/rocm.sh %runscript if ! [ $# -gt 0 ] then /bin/bash --rcfile /etc/profile -l else /bin/bash --rcfile /etc/profile -l -c "$@" fi ``` ```sh # build the container definition above export APPTAINER_CONTAINER=$LUSTRE_HOME/containers/rocm-5.4.3.sif apptainer build $APPTAINER_CONTAINER apptainer.def # request the allocation of GPU salloc --partition gpu --gres=gpu:1 # start an interactive container with GPU support... srun --pty -- apptainer run --rocm $APPTAINER_CONTAINER ``` ### Usage - `rocminfo` ...enumerate GPU agents available on a working ROCm stack # Frameworks ## OpenCL [OpenCL](https://www.khronos.org/opencl)... - ...open heterogeneous computing standard from the Khronos Group - ...supports GPUs, CPUs and FPGAs ...widely used, but less common in HPC - Support - ...[Nvidia OpenCL SDK](https://developer.nvidia.com/opencl) - ...[AMD ROCm OpenCL Runtime](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime) - ...Spack [`rocm-opensl`](https://spack.readthedocs.io/en/latest/package_list.html#rocm-opencl) package - ...Python binding [PyOpenCL](https://documen.tician.de/pyopencl) ## SYCL [SYCL](https://www.khronos.org/sycl/) (pronounced ‘sickle’) - ...modern heterogeneous compute standard from the Khronos Group - ...supports simultaneous use of CPUs, GPUs, and FPGAs - ...compiler optimizes code across different architectures - ...similar to the programming models of CUDA & ROCm HIP - References... - [SYCL Parallel STL](https://github.com/KhronosGroup/SyclParallelSTL) in C++ implementing the Khronos SYCL standard - [SYCL Specification](https://registry.khronos.org/SYCL) ## OpenACC [OpenACC](https://www.openacc.org)... - ...standard for compiler pragma’s that support offloading to accelerator devices - ...used from C, C++ and Fortran ## HIP HIP (Heterogeneous-Computing Interface for Portability) - ...AMD GPU programming environment ...designing high performance kernels on GPUs - ...C++ runtime API ...portable code to run on AMD and NVIDIA GPUs - ...layer (or wrapper) that uses the underlying ROCm or CUDA platform - ...HIP similar to CUDA ...virtually no performance overhead on Nvidia hardware - References... - ...[HIP](https://github.com/ROCm-Developer-Tools/HIP) repository, ROCm developer tools - ...[HIPFY](https://github.com/ROCm-Developer-Tools/HIPIFY) to translate CUDA source code # Reference - P|R|K (Parallel Research Kernels) - <https://github.com/ParRes/Kernels> - PyTorch-Benchmarks - ...compatible to CUDA (NVIDIA) and ROCm (AMD) - <https://github.com/aime-team/pytorch-benchmarks>