HPC GPU Accelerators

Hardware
Linux
HPC
Published

January 16, 2023

Modified

February 13, 2023

GPUs (graphics processing units) in HPC…

Motivation

GPUs performance improves faster then CPUs…

  • …driven by demand for the video game marked
  • …specialised architectures simplifies scaling of transistors
  • Modern GPUs…
    • …highly programmable
    • …mature high-level language support
    • …support for 32/64 bit floating point arithmetic

GPUs vs CPUs

CPUs advantage…

  • …large main memory (RAM)
    • …latency optimized by large caches
    • …designed with random access in mind
  • …small number of threads can run very quickly
  • …features for fast synchronisation on multi-core
  • Disadvantage…
    • …relatively low memory bandwidth
    • …limited number of cores (in comparison)
    • …low performance per watt (compared to GPUs)

GPUs advantage…

  • …high throughput on structured data
  • …high bandwidth main memory (HBM)
  • …scalar instructions (inherently parallel)
  • …significantly more compute resources
  • Disadvantage…
    • …data movement explicit
    • …smaller memory capacity (compared to main RAM)
    • …low per-thread performance

Applications

Typical computation not suitable to GPUs…

  • …highly serial algorithms …no inherent parallelism
  • …strongly memory bound computations
    • …large data set …small number of operations per data set
    • …consider memory access costs
    • …unless CPU/GPU have shared memory
  • …highly unstructured data
    • …complex flow of the computation
    • …high frequency of data access barriers

Terminology

GPGPUs (General Purpose Graphical Processing Units

  • …many-core (very many simple cores)
  • …massive parallelism …lots of concurrent threads
  • …different programming mode emerged to use GPU for data processing
    • …allows software to use GPUs for general purpose processing
  • …contribute to more energy efficiency

Compute kernel (aka GPU kernel)…

  • …not to be confuses with a OS kernel
  • …code compiled for high throughput accelerators
  • …use execution units with vertex shaders and pixel shaders on GPUs
  • Keywords…
    • thread …single computational task on the GPU kernel
    • thread block …group of threads in the same location on the GPU
    • grid …collection of thread blocks

HSA (Heterogeneous System Architecture)…

  • …cross-vendor (mostly AMD) …specifications by the HSA Foundation
  • …integrate CPUs & GPUs on single bus (shared memory)
  • …reduce communication latency between CPUs & GPUs
  • …relieving programmer from moving of data between devices

Hardware

Basic GPU architecture

  • …100+ cores on a single GPU chip
  • …each core works multiple threads of instructions
  • …possible to run 1000+ threads in parallel
  • Memory…
    • …per-thread (private) local memory
    • …per thread-block shared memory
    • …grids of thread-blocks share a global memory (per application context)

Identify a GPU device on a host…

>>> lspci | grep -i display
63:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 #...

Form Factor

  • Dual-slot PCIe card
    • …high-end cards typically full-height, full-length (FHFL)
    • …with axial active-cooling …pushes air to the backside of the chassis
  • OAM (OCP Accelerator Module)…
    • …defines form factor & specifications for a compute accelerator module
    • …contrast with a PCIe add-in card form factor
    • …simplifying interconnecting high-speed communication links among modules
  • SXM (Scalable Link Interface for eXternal Memory)
    • …SXM-modules sit directly on top of a motherboard …use a dedicated connector
    • …higher memory capacity and bandwidth …less latency

Server chassis facilitate 2x, 4x, or 8x GPUs

  • …most common are 4x GPU systems
    • …due to thermal and power boundaries by the infrastructure
    • …limits to the required PCIe lanes (typically 16x per GPU)
  • …multi-GPU nodes do not mix different GPU models

AMD

CDNA …GPU architecture for data center

Microarchitecture Release Accelerator VRAM Form Factor F64 Flops TDP
CDNA3 2024 MI300 128GB
CDNA2 2021 MI250 128GB OAM 45.3T 560W
2022 MI210 64GB OAM/PCIe 22.6T 300W
CDNA 2020 MI100 32GB PCIe 11.5T 300W
GCN5 2018 MI60 32GB PCIe 7.4T 300W
2017 MI50 16GB PCIe 6.6T 300W

Infinity Fabric… coherent memory space shared by CPU & GPU

  • …unified memory …eliminate redundant memory copies
  • …no additional main memory required
  • …dynamic memory allocation between CPU & GPU

Drivers

AMDGPU stack repository

  • …supports RHEL, SLES, Ubuntu
  • amdgpu-* packages include the kernel-mode driver.
  • …tools are available from the ROCm stack repository
RHEL Release ROCm Drivers
7.9 8.6 8.7 9.0 2022/11 22.20.5
7.9 8.6 9.0 2022/06 22.10.4
7.9 8.4 8.5 2022/02 21.50.2
# ...install dependencies
dnf install -y \
      kernel-headers-`uname -r` kernel-devel-`uname -r` dkms \
      autoconf automake m4 perl-Thread-Queue

# install kernel modules and hardware tools
dnf install -y \
      dkms amdgpu-dkms-firmware amdgpu-dkms-headers \
      rocm-core rocm-smi-lib rocminfo hsa-rocr \
      comgr rocm-opencl orcm-ocl-icd

Use lsmod to look for the loaded kernel modules…

lsmod | grep amdgpu
amdgpu               9789440  0
amddrm_ttm_helper      16384  1 amdgpu
amdttm                 81920  2 amdgpu,amddrm_ttm_helper
iommu_v2               20480  1 amdgpu
amd_sched              40960  1 amdgpu
amdkcl                 28672  3 amd_sched,amdttm,amdgpu
i2c_algo_bit           16384  3 igb,ast,amdgpu
drm_kms_helper        270336  5 drm_vram_helper,ast,amdgpu
drm                   589824  10 drm_kms_helper,amd_sched,amdttm,drm_vram_helper #...

rocm-smi

SMI (system management interface)…

  • …documented at ROCm deployment tools
  • ..clock and temperature management
  • …exposed by the rocm-smi command
# ...no flags/options
>>> /opt/rocm/bin/rocm-smi 
#...
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    39.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
1    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
2    41.0c           36.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
3    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
4    38.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
5    38.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
6    37.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
7    39.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
#...
  • Utilization per package
    • -u …current GPU use (in percent)
    • --showmemuse …memory used (in percent)
    • --showvoltage …show voltage
  • -P …current average graphics package power consumption in Watts
  • -f …fan speed
  • --showtemp …insight into the system health
    • edge temperature …most recently measured temperature
    • junction hot-spot temperature …highest temperature value of all sensors
    • memory temperature …hottest HBM stack
>>> rocm-smi --showtemp
#....
GPU[0]          : Temperature (Sensor edge) (C): 39.0
GPU[0]          : Temperature (Sensor junction) (C): 41.0
GPU[0]          : Temperature (Sensor memory) (C): 38.0

rvs

RVS (ROCm Validation Suite)…

  • …documentation rocmdocs.amd.com
    • source code on GitHub
    • …install the rocm-validation-suite package from the ROCm repository
  • …detect/troubleshot common problems affecting AMD GPUs
  • …tests, benchmarks, and qualification tools
    • …tests is implemented in a modules
    • …each module has a dedicated set of options and a configuration file

Options …-g list GPU devices

# list GPUs
>>> rvs -g
#...
Supported GPUs available:
0000:63:00.0 - GPU[ 2 - 44650]  (Device 29580)
0000:43:00.0 - GPU[ 3 - 59802]  (Device 29580)
0000:03:00.0 - GPU[ 4 - 23480]  (Device 29580)
0000:27:00.0 - GPU[ 5 - 39789]  (Device 29580)
0000:E3:00.0 - GPU[ 6 - 51758]  (Device 29580)
0000:C3:00.0 - GPU[ 7 -  1375]  (Device 29580)
0000:83:00.0 - GPU[ 8 - 30589]  (Device 29580)
0000:A3:00.0 - GPU[ 9 - 15436]  (Device 29580)

Option …-t list test modules

  • gpup …queries the configuration of a target device
  • gm …GPU monitoring tool
  • gst …GPU stress test
  • pesm …PCIe state monitor
  • pbqt …list of all GPUs that support peer-2-peer
  • peqt …qualify the PCIe bus on which the GPU is connected
  • pebb …PCIe bandwidth benchmark

rdc

ROCm™ Data Center Tool™ (RDC)…

  • …documented at ROCm deployment tools
  • rdc package in the ROCm stack repository
  • …telemetry and diagnostics
    • …provide Python bindings
    • …includes a Prometheus and Grafana plugins
  • Two operation modes…
    • …standalone …rdcd (daemon) runs on each compute node
    • …embedded …interface for user monitoring agents

Intel

Intel Xeon Phi (Knights Landing) discontinued in 2016

Intel Xe data-center GPUs…

Microarchitecture Release Accelerator VRAM From Factor TDP
Rialto Bridge ? ? ? ?
Ponte Vecchio 2023 Max 1100 48GB PCIe/OAM 300W
2023 Max 1350 96GB OAM 450W
2023 Max 1550 128GB OAM 600W

Nvidia

…data-center GPUs (formally Tesla)

Microarchitecture Release Accelerator VRAM Form Factor GPU/Tensor Cores TDP
Grace Hopper 2023 H100 80GB (HBM2) PCIe 14592/456 350W
Ampere 2020 A100 80GB (HBM2) PCIe 6912/512 400W
2020 A100 40GB (HBM2) PCIe 6912/512 400W
Volta 2017 V100 32GB (HBM2) PCIe 5120/640 350W
2017 V100 16GB (HBM2) PCIe 5120/640 300W
Pascal 2016 P100 16GB PCIe 300W
Kepler 2014 K40 12GB PCIe 235W
2012 K20 5GB PCIe 235W

Notable features…

  • RDMA via GPUDirect
    • …allows other devices (e.g. InfiniBand) access to memory
    • …improves MPI latency for send/receive to GPU memory
  • NVLink …high speed interconnect
    • …connects GPUS with higher bandwidth then PCIe
    • …supports shared memory across GPUs
    • …integrates with NVLink-enabled CPUs

Platforms

CUDA (Nvidia)

CUDA

  • …used from C/C++, Fortran, Python, Matlab, Julia, and others
  • …large ecosystem of GPU computing libraries that are built on CUDA
  • …depends proprietary drivers …restricted to NVIDIA hardware

Access to GPU computing from Python…

ROCm (AMD)

ROC (Radeon Open Compute) …software development platform for HPC GPU computing

Add the ROCm repository …adjust baseurl accordingly…

cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
  • ROCm stack repository provide binary packages
  • …meta packages…
    • rocm-hip-runtime …for applications implementing AMD HIP
    • rocm-hip-sdk …HIP development environment
    • rocm-opencl-runtime …run OpenCL based applications
    • rocm-opencl-sdk …OpenCL development evironment
  • …installs to /opt/rocm

Containers

Apptainer definition for a container with ROCm SDK…

# vim: ft=bash
BootStrap: docker
From: quay.io/rockylinux/rockylinux:8

%labels
Author Victor Penso

%post
cat > /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/rocm/rhel8/5.4.3/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

dnf install -y wget gawk curl epel-release dnf-plugins-core
dnf config-manager --set-enabled powertools
dnf install -y rocm-hip-sdk rocm-opencl-sdk rocm-validation-suite
dnf clean all

echo 'export PATH=/opt/rocm/bin:$PATH' > /etc/profile.d/rocm.sh

%runscript
if ! [ $# -gt 0 ]
then
      /bin/bash --rcfile /etc/profile -l
else
      /bin/bash --rcfile /etc/profile -l -c "$@"
fi
# build the container definition above
export APPTAINER_CONTAINER=$LUSTRE_HOME/containers/rocm-5.4.3.sif
apptainer build $APPTAINER_CONTAINER apptainer.def
# request the allocation of GPU
salloc --partition gpu --gres=gpu:1 
# start an interactive container with GPU support...
srun --pty -- apptainer run --rocm $APPTAINER_CONTAINER

Usage

  • rocminfo …enumerate GPU agents available on a working ROCm stack

Frameworks

OpenCL

OpenCL

SYCL

SYCL (pronounced ‘sickle’)

  • …modern heterogeneous compute standard from the Khronos Group
  • …supports simultaneous use of CPUs, GPUs, and FPGAs
  • …compiler optimizes code across different architectures
  • …similar to the programming models of CUDA & ROCm HIP
  • References…

OpenACC

OpenACC

  • …standard for compiler pragma’s that support offloading to accelerator devices
  • …used from C, C++ and Fortran

HIP

HIP (Heterogeneous-Computing Interface for Portability)

  • …AMD GPU programming environment …designing high performance kernels on GPUs
  • …C++ runtime API …portable code to run on AMD and NVIDIA GPUs
  • …layer (or wrapper) that uses the underlying ROCm or CUDA platform
  • …HIP similar to CUDA …virtually no performance overhead on Nvidia hardware
  • References…
    • HIP repository, ROCm developer tools
    • HIPFY to translate CUDA source code

Reference