HPC GPU Accelerators
GPUs (graphics processing units) in HPC…
- …offer a parallel architecture to speed up certain computing processes…
- …large number of independent operations on highly structured data
- …especially those related to artificial intelligence (AI) and machine learning (ML) models
- …enables processing of applications…
- …with higher efficiency
- …with less power consumption
- …therefore at a lower cost
Motivation
GPUs performance improves faster then CPUs…
- …driven by demand for the video game marked
- …specialised architectures simplifies scaling of transistors
- Modern GPUs…
- …highly programmable
- …mature high-level language support
- …support for 32/64 bit floating point arithmetic
GPUs vs CPUs
CPUs advantage…
- …large main memory (RAM)
- …latency optimized by large caches
- …designed with random access in mind
- …small number of threads can run very quickly
- …features for fast synchronisation on multi-core
- Disadvantage…
- …relatively low memory bandwidth
- …limited number of cores (in comparison)
- …low performance per watt (compared to GPUs)
GPUs advantage…
- …high throughput on structured data
- …high bandwidth main memory (HBM)
- …scalar instructions (inherently parallel)
- …significantly more compute resources
- Disadvantage…
- …data movement explicit
- …smaller memory capacity (compared to main RAM)
- …low per-thread performance
Applications
Typical computation not suitable to GPUs…
- …highly serial algorithms …no inherent parallelism
- …strongly memory bound computations
- …large data set …small number of operations per data set
- …consider memory access costs
- …unless CPU/GPU have shared memory
- …highly unstructured data
- …complex flow of the computation
- …high frequency of data access barriers
Terminology
GPGPUs (General Purpose Graphical Processing Units
- …many-core (very many simple cores)
- …massive parallelism …lots of concurrent threads
- …different programming mode emerged to use GPU for data processing
- …allows software to use GPUs for general purpose processing
- …contribute to more energy efficiency
Compute kernel (aka GPU kernel)…
- …not to be confuses with a OS kernel
- …code compiled for high throughput accelerators
- …use execution units with vertex shaders and pixel shaders on GPUs
- Keywords…
- …thread …single computational task on the GPU kernel
- …thread block …group of threads in the same location on the GPU
- …grid …collection of thread blocks
HSA (Heterogeneous System Architecture)…
- …cross-vendor (mostly AMD) …specifications by the HSA Foundation
- …integrate CPUs & GPUs on single bus (shared memory)
- …reduce communication latency between CPUs & GPUs
- …relieving programmer from moving of data between devices
Hardware
Basic GPU architecture
- …100+ cores on a single GPU chip
- …each core works multiple threads of instructions
- …possible to run 1000+ threads in parallel
- Memory…
- …per-thread (private) local memory
- …per thread-block shared memory
- …grids of thread-blocks share a global memory (per application context)
Identify a GPU device on a host…
>>> lspci | grep -i display
63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 #...
Form Factor
- Dual-slot PCIe card
- …high-end cards typically full-height, full-length (FHFL)
- …with axial active-cooling …pushes air to the backside of the chassis
- OAM (OCP Accelerator Module)…
- …defines form factor & specifications for a compute accelerator module
- …contrast with a PCIe add-in card form factor
- …simplifying interconnecting high-speed communication links among modules
- SXM (Scalable Link Interface for eXternal Memory)
- …SXM-modules sit directly on top of a motherboard …use a dedicated connector
- …higher memory capacity and bandwidth …less latency
Server chassis facilitate 2x, 4x, or 8x GPUs
- …most common are 4x GPU systems
- …due to thermal and power boundaries by the infrastructure
- …limits to the required PCIe lanes (typically 16x per GPU)
- …multi-GPU nodes do not mix different GPU models
AMD
CDNA …GPU architecture for data center
Microarchitecture | Release | Accelerator | VRAM | Form Factor | F64 Flops | TDP |
---|---|---|---|---|---|---|
CDNA3 | 2024 | MI300 | 128GB | |||
CDNA2 | 2021 | MI250 | 128GB | OAM | 45.3T | 560W |
2022 | MI210 | 64GB | OAM/PCIe | 22.6T | 300W | |
CDNA | 2020 | MI100 | 32GB | PCIe | 11.5T | 300W |
GCN5 | 2018 | MI60 | 32GB | PCIe | 7.4T | 300W |
2017 | MI50 | 16GB | PCIe | 6.6T | 300W |
Infinity Fabric… coherent memory space shared by CPU & GPU
- …unified memory …eliminate redundant memory copies
- …no additional main memory required
- …dynamic memory allocation between CPU & GPU
Drivers
- …supports RHEL, SLES, Ubuntu
- …
amdgpu-*
packages include the kernel-mode driver. - …tools are available from the ROCm stack repository
RHEL | Release | ROCm Drivers |
---|---|---|
7.9 8.6 8.7 9.0 | 2022/11 | 22.20.5 |
7.9 8.6 9.0 | 2022/06 | 22.10.4 |
7.9 8.4 8.5 | 2022/02 | 21.50.2 |
# ...install dependencies
dnf install -y \
`uname -r` kernel-devel-`uname -r` dkms \
kernel-headers-
autoconf automake m4 perl-Thread-Queue
# install kernel modules and hardware tools
dnf install -y \
\
dkms amdgpu-dkms-firmware amdgpu-dkms-headers \
rocm-core rocm-smi-lib rocminfo hsa-rocr comgr rocm-opencl orcm-ocl-icd
Use lsmod
to look for the loaded kernel modules…
lsmod | grep amdgpu
amdgpu 9789440 0
amddrm_ttm_helper 16384 1 amdgpu
amdttm 81920 2 amdgpu,amddrm_ttm_helper
iommu_v2 20480 1 amdgpu
amd_sched 40960 1 amdgpu
amdkcl 28672 3 amd_sched,amdttm,amdgpu
i2c_algo_bit 16384 3 igb,ast,amdgpu
drm_kms_helper 270336 5 drm_vram_helper,ast,amdgpu
drm 589824 10 drm_kms_helper,amd_sched,amdttm,drm_vram_helper #...
rocm-smi
SMI (system management interface)…
- …documented at ROCm deployment tools
- ..clock and temperature management
- …exposed by the
rocm-smi
command
# ...no flags/options
>>> /opt/rocm/bin/rocm-smi
#...
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 39.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
1 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
2 41.0c 36.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
3 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
4 38.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 37.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 39.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
#...
- Utilization per package
- …
-u
…current GPU use (in percent) - …
--showmemuse
…memory used (in percent) - …
--showvoltage
…show voltage
- …
-P
…current average graphics package power consumption in Watts-f
…fan speed--showtemp
…insight into the system health- …
edge
temperature …most recently measured temperature - …
junction
hot-spot temperature …highest temperature value of all sensors - …
memory
temperature …hottest HBM stack
- …
>>> rocm-smi --showtemp
#....
GPU[0] : Temperature (Sensor edge) (C): 39.0
GPU[0] : Temperature (Sensor junction) (C): 41.0
GPU[0] : Temperature (Sensor memory) (C): 38.0
rvs
RVS (ROCm Validation Suite)…
- …documentation rocmdocs.amd.com
- …source code on GitHub
- …install the
rocm-validation-suite
package from the ROCm repository
- …detect/troubleshot common problems affecting AMD GPUs
- …tests, benchmarks, and qualification tools
- …tests is implemented in a modules
- …each module has a dedicated set of options and a configuration file
Options …-g
list GPU devices
# list GPUs
>>> rvs -g
#...
Supported GPUs available:
0000:63:00.0 - GPU[ 2 - 44650] (Device 29580)
0000:43:00.0 - GPU[ 3 - 59802] (Device 29580)
0000:03:00.0 - GPU[ 4 - 23480] (Device 29580)
0000:27:00.0 - GPU[ 5 - 39789] (Device 29580)
0000:E3:00.0 - GPU[ 6 - 51758] (Device 29580)
0000:C3:00.0 - GPU[ 7 - 1375] (Device 29580)
0000:83:00.0 - GPU[ 8 - 30589] (Device 29580)
0000:A3:00.0 - GPU[ 9 - 15436] (Device 29580)
Option …-t
list test modules
gpup
…queries the configuration of a target devicegm
…GPU monitoring toolgst
…GPU stress testpesm
…PCIe state monitorpbqt
…list of all GPUs that support peer-2-peerpeqt
…qualify the PCIe bus on which the GPU is connectedpebb
…PCIe bandwidth benchmark
rdc
ROCm™ Data Center Tool™ (RDC)…
- …documented at ROCm deployment tools
- …
rdc
package in the ROCm stack repository - …telemetry and diagnostics
- …provide Python bindings
- …includes a Prometheus and Grafana plugins
- Two operation modes…
- …standalone …
rdcd
(daemon) runs on each compute node - …embedded …interface for user monitoring agents
- …standalone …
Intel
Intel Xeon Phi (Knights Landing) discontinued in 2016
Intel Xe data-center GPUs…
Microarchitecture | Release | Accelerator | VRAM | From Factor | TDP |
---|---|---|---|---|---|
Rialto Bridge | ? | ? | ? | ? | |
Ponte Vecchio | 2023 | Max 1100 | 48GB | PCIe/OAM | 300W |
2023 | Max 1350 | 96GB | OAM | 450W | |
2023 | Max 1550 | 128GB | OAM | 600W |
Nvidia
…data-center GPUs (formally Tesla)
Microarchitecture | Release | Accelerator | VRAM | Form Factor | GPU/Tensor Cores | TDP |
---|---|---|---|---|---|---|
Grace Hopper | 2023 | H100 | 80GB (HBM2) | PCIe | 14592/456 | 350W |
Ampere | 2020 | A100 | 80GB (HBM2) | PCIe | 6912/512 | 400W |
2020 | A100 | 40GB (HBM2) | PCIe | 6912/512 | 400W | |
Volta | 2017 | V100 | 32GB (HBM2) | PCIe | 5120/640 | 350W |
2017 | V100 | 16GB (HBM2) | PCIe | 5120/640 | 300W | |
Pascal | 2016 | P100 | 16GB | PCIe | 300W | |
Kepler | 2014 | K40 | 12GB | PCIe | 235W | |
2012 | K20 | 5GB | PCIe | 235W |
Notable features…
- RDMA via GPUDirect
- …allows other devices (e.g. InfiniBand) access to memory
- …improves MPI latency for send/receive to GPU memory
- NVLink …high speed interconnect
- …connects GPUS with higher bandwidth then PCIe
- …supports shared memory across GPUs
- …integrates with NVLink-enabled CPUs
Platforms
CUDA (Nvidia)
CUDA…
- …used from C/C++, Fortran, Python, Matlab, Julia, and others
- …large ecosystem of GPU computing libraries that are built on CUDA
- …depends proprietary drivers …restricted to NVIDIA hardware
Access to GPU computing from Python…
ROCm (AMD)
ROC (Radeon Open Compute) …software development platform for HPC GPU computing
- …documentation at docs.amd.com
- …deprecated rocmdocs.amd.com
- …examples at github.com/amd/rocm-examples
Add the ROCm repository …adjust baseurl
accordingly…
cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
- ROCm stack repository provide binary packages
- …meta packages…
rocm-hip-runtime
…for applications implementing AMD HIProcm-hip-sdk
…HIP development environmentrocm-opencl-runtime
…run OpenCL based applicationsrocm-opencl-sdk
…OpenCL development evironment
- …installs to
/opt/rocm
Containers
- …containers on DockerHub
- …example Dockerfiles
- …host needs host needs the ROCm kernel
rocm-core
with kernel modules - …Apptainer support for AMD GPUs & ROCm
Apptainer definition for a container with ROCm SDK…
# vim: ft=bash
BootStrap: docker
From: quay.io/rockylinux/rockylinux:8
%labels
Author Victor Penso
%post
cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
dnf install -y wget gawk curl epel-release dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf install -y rocm-hip-sdk rocm-opencl-sdk rocm-validation-suite
dnf clean all
echo 'export PATH=/opt/rocm/bin:$PATH' > /etc/profile.d/rocm.sh
%runscript
if ! [ $# -gt 0 ]
then
/bin/bash --rcfile /etc/profile -l
else
/bin/bash --rcfile /etc/profile -l -c "$@"
fi
# build the container definition above
export APPTAINER_CONTAINER=$LUSTRE_HOME/containers/rocm-5.4.3.sif
apptainer build $APPTAINER_CONTAINER apptainer.def
# request the allocation of GPU
salloc --partition gpu --gres=gpu:1
# start an interactive container with GPU support...
srun --pty -- apptainer run --rocm $APPTAINER_CONTAINER
Usage
rocminfo
…enumerate GPU agents available on a working ROCm stack
Frameworks
OpenCL
- …open heterogeneous computing standard from the Khronos Group
- …supports GPUs, CPUs and FPGAs …widely used, but less common in HPC
- Support
- …Nvidia OpenCL SDK
- …AMD ROCm OpenCL Runtime
- …Spack
rocm-opensl
package - …Python binding PyOpenCL
SYCL
SYCL (pronounced ‘sickle’)
- …modern heterogeneous compute standard from the Khronos Group
- …supports simultaneous use of CPUs, GPUs, and FPGAs
- …compiler optimizes code across different architectures
- …similar to the programming models of CUDA & ROCm HIP
- References…
- SYCL Parallel STL in C++ implementing the Khronos SYCL standard
- SYCL Specification
OpenACC
- …standard for compiler pragma’s that support offloading to accelerator devices
- …used from C, C++ and Fortran
HIP
HIP (Heterogeneous-Computing Interface for Portability)
- …AMD GPU programming environment …designing high performance kernels on GPUs
- …C++ runtime API …portable code to run on AMD and NVIDIA GPUs
- …layer (or wrapper) that uses the underlying ROCm or CUDA platform
- …HIP similar to CUDA …virtually no performance overhead on Nvidia hardware
- References…
Reference
- P|R|K (Parallel Research Kernels)
- PyTorch-Benchmarks
- …compatible to CUDA (NVIDIA) and ROCm (AMD)
- https://github.com/aime-team/pytorch-benchmarks