Parallel Computing

HPC

Published

November 14, 2021

Modified

July 4, 2023

Number of scientific applications using parallel computing is growing …capabilities range from…

…solving physics calculations (from the smallest to the largest objects in the universe)
…making better drugs for diseases like cancer and COVID-19
…running comprehensive simulation applications at different scales
…incorporating new data science approaches like machine learning (ML) and artificial intelligence (AI)
…data-analysis and visualization of big data sets

What is HPC & HTC?

Compute clusters aka parallel computers called…

…HPC (High-Performance Computer) or Super-Computer

…many compute nodes connected by a fast network
…each node provides multiple CPUs (and GPUs)
…needs application specifically build with parallelism
- …to use multiple CPU cores…
- …across multiple compute nodes

HTC (High-Throughput Computing)…

…workload using HPC computing infrastructure
…run may single-core instances of an application
- …barely parallel …no communication between tasks required
- …goal is to achieve scaling by the number of CPUs

What is Parallel Programming?

Aka performance programming…

Utilize the ability of an algorithm to execute in parallel…
- …task parallelism …break large operations into smaller tasks
- …data parallelism …same operation on multiple data elements
…with the goal to improve performance on modern hardware

Amdahl’s Law

Defines (upper limit on) speedup of code as a function of…

…proportion of code that can be parallelized
…and the number of processors used
…speedup depends only on the parallel content

Divide program in…

…sequential part S (can not be improved through parallelism)
…parallel part P …with S + P = 1
…speedup 1/(S + (P/n))…n number of cores

Most parallel applications do not scale to thousands cores…

Strategies to achieve high scalability

…grow problem size with number of cores…Gustafson’s Law
…increase ratio of computation to communication
…overlap communication with computation
…dynamic load balancing to assign work to idle cores

Processes & Threads

…both independent sequences of execution…

Process…
- …instance of a program
- …runs on dedicated “owned” memory
- …holds state and file descriptors
- …more overhead then thread
- …more flexible…
  - …multiple parallel processes on single computer
  - …or across multiple compute nodes (distributed memory)
Threads…
- …lightweight entity …executes inside process
- …each process has at least on threat
- …shared memory for all threat in a process
- …less overhead then a process
- …less flexible …confined to a single process/node

Parallel Memory Access

…determines how/where to execute code…

Shared memory
- …threads …run on a single node
- …typically used implementing OpenMP (supported by most compilers)
Distributed memory…
- …multiple processes …program instances
- …executed on one or more nodes
- …typically used by implementing MPI (provided by a library)

Both models need to consider how processes/threads are mapped/bound to cores

Hybrid memory…
- …balance between threads and processes
  - …shared memory to keep footprint small
  - …overcome scaling limits…run across multiple node
- …typically involves OpenMP and MPI

Standards-based Parallelism

Three ways to accelerate applications with parallelism…

…programming library …easy to use
…compiler directives …easy to use & portable code
…programming language …most performance …most flexible

MPI …message passing parallelism…

…parametric …coarse-grained outer loop …essentially task parallel
…structured domains …decomposition with local operations
…lots of communication …load-balancing required
…works well for large systems of equations

OpenMP …shared memory parallelism…

…statically scheduled parallel loops
…parallel regions …dynamic load balance

Some applications can take advantage of message passing and threads

OpenACC …data offload to an accelerator (typically a GPU)…

…directives to specify code and data …compiler does the work…
- …initialize device & run-time environment
- …allocate memory on device …move data from the host
- …launch the code (computational kernels) on the device
- …gather results from the device memory
- …deallocate data on device

Algorithm Classes

Abstract application categories…

…13 dwarfs of HPC
…facilitate cross-platform reasoning
…cross-application component reuse

Dense Linear Algebra

…densely-populated matrices (or vectors)
- …typically uncompressed
- …often accessed via strides
…used for linear algebra…
- …matrix multiplication
- …LU decomposition
- …Gauss–Seidel method
- …Jacobi method
…widely supported with libraries
…(still) default measure for HPC performance

Sparse Linear Algebra

…sparsely-populated matrices (or vectors)
- …vast majority of data zeros
- …typically compressed
- …often indirect access by indices
…used for…
- …conjugated gradient
- …ranking algorithms
- …data mining

Spectral Methods

…data in frequency domain …not time & space
…used for…
- …FFT (fast Fourier transform)
- …audio & video signal processing
…usually implemented using butterfly patterns
…often latency limited …global communication …all-to-all
…relies on transposing data efficiently

N-body Methods

Model interactions between discreet moving points

…require dynamic data structures
- …affects load balancing
- …data access latency
…used for…
- …galaxy collision simulation
- …molecular dynamics
- …protein folding

Structured Grids

Model interaction between discreet fixed points…

…grid structure described by patterns
- …topology information easily derived
- …usually high spatial locality
…may be subdivided into finer grid
…used for …computational fluid dynamics
Characteristics…
- …structured nature key aspect
- …predictable memory addresses
- …local communication only
- …adaptive grid possible

Unstructured Grids

Model interactions between discreet fixed points…

…grid described explicitly by individual connections
…irregular geometry & topology
…multiple levels of indirection when accessing data
…used for …computational fluid dynamics
Characteristics…
- …heavy latency bound…indirect data access
- …

Monte Carlo Methods

Models statistical evaluation of repeated random trials

…process data independently…merge the result
…also know as map-reduce
…embarrassingly parallel …multiple copies of sequential method
…used for…
- …numerical integration
- …ray-tracing
- …quantum many-body problem

Combinational Logic

Involves performing simple operations on large amounts of integer data…

…parallelized on multiple levels …bit-level …block-level
…typically no global communication
…used for …computing CRC (cycling redundancy checks)

Graph Traversal

Traverse a number of objects in a graph…

…examine characteristics
- …search
- …sort
- …collision detection
- …decision tress
…usually heavy on data read/lookup
…parallelizable over different paths

Dynamic Programming

Method of computing solutions by solving simpler, overlapping problems…

…optimal result composed of optimal results of sub-problem
…memoization…solve sub-problem once…reuse stored result
…used for…
- …lookup tables
- …matrix-chain-multiplication

Backtrack & Branch+Bound

Search and optimization problems…very large problem spaces…

…incrementally build solution…discard if determined unsuitable
…uses divide & conquer strategy…
- …break down complex problems into smaller sub-problems
- …solve sub-problems in parallel

Graphical Models

Probabilistic models…

…graph consisting of random variables as nodes
…dependencies as edges
…used for…
- …Bayesian networks
- …Hidden Markov models

Finite State Machines

Represent an interconnected set of states to be moved among

…sometimes decomposed into multiple state machines
…used for parsers

AI Algorithms

Use of AI (Artificial Intelligence) and ML (Machine Learning)

…also called… DL (Deep Learning) …DNN (Deep Neural Networks)
Used for…
- …speech recognition and photo manipulation in smart phones
- …camera video processing for car driver support systems
- …video scaling in TVs and video gaming hardware
- …scientific computing in HPC environments
Dedicated accelerator hardware…
- …used in HPC, Cars, Smart TVs/Phones
- …NPU (Neural Processing Unit) or TPU (Tensor Processing Unit) aka Tensor cores

Abbreviation for AI models…

NN (Neural Network) …further divided into…
- ANN (Artificial NN)
- DNN (Deep NN)
- CNN (Convolutional NN)
- RNN (Recurrent NN)
- SNN (Spiking NN)
LSTM (Long Short-Term Memory)
GPT (Generative Pretrained Transformer)
BERT (Bidirectional Encoder Representation from Transformers)

--- title: Parallel Computing categories: - HPC date: 2021/11/14 date-modified: 2023/07/04 --- Number of scientific applications using parallel computing is growing ...capabilities range from... - ...solving physics calculations (from the smallest to the largest objects in the universe) - ...making better drugs for diseases like cancer and COVID-19 - ...running comprehensive simulation applications at different scales - ...incorporating new data science approaches like machine learning (ML) and artificial intelligence (AI) - ...data-analysis and visualization of big data sets # What is HPC & HTC? Compute **clusters** aka parallel computers called... ...**HPC** (High-Performance Computer) or Super-Computer * ...many compute nodes connected by a fast network * ...each node provides multiple CPUs (and GPUs) * ...needs application specifically build with parallelism * ...to use multiple CPU cores... * ...across multiple compute nodes **HTC** (High-Throughput Computing)... * ...workload using HPC computing infrastructure * ...run may single-core instances of an application - ...barely parallel ...no communication between tasks required - ...goal is to achieve scaling by the number of CPUs # What is Parallel Programming? Aka performance programming... - Utilize the ability of an algorithm to execute in parallel... - ...**task parallelism** ...break large operations into smaller tasks - ...**data parallelism** ...same operation on multiple data elements - ...with the goal to improve performance on modern hardware ## Amdahl's Law Defines (upper limit on) speedup of code as a function of... * ...proportion of code that can be parallelized * ...and the number of processors used * ...speedup depends only on the parallel content Divide program in... * ...sequential part `S` (can not be improved through parallelism) * ...parallel part `P` ...with `S + P = 1` * ...speedup `1/(S + (P/n))`...`n` number of cores Most parallel applications do not scale to thousands cores... Strategies to achieve high scalability * ...grow problem size with number of cores...**Gustafson's Law** * ...increase ratio of computation to communication * ...overlap communication with computation * ...dynamic load balancing to assign work to idle cores ## Processes & Threads ...both independent sequences of execution... * Process... * ...instance of a program * ...runs on dedicated "owned" memory * ...holds state and file descriptors * ...more overhead then thread * ...more flexible... * ...multiple parallel processes on single computer * ...or across multiple compute nodes (distributed memory) * Threads... * ...lightweight entity ...executes inside process * ...each process has at least on threat * ...shared memory for all threat in a process * ...less overhead then a process * ...less flexible ...confined to a single process/node ## Parallel Memory Access ...determines how/where to execute code... * **Shared** memory * ...threads ...run on a single node * ...typically used implementing OpenMP (supported by most compilers) * **Distributed** memory... * ...multiple processes ...program instances * ...executed on one or more nodes * ...typically used by implementing MPI (provided by a library) Both models need to consider how processes/threads are mapped/bound to cores * **Hybrid** memory... * ...balance between threads and processes * ...shared memory to keep footprint small * ...overcome scaling limits...run across multiple node * ...typically involves OpenMP and MPI # Standards-based Parallelism Three ways to accelerate applications with parallelism... - ...programming library ...easy to use - ...compiler directives ...easy to use & portable code - ...programming language ...most performance ...most flexible [MPI](https://www.mpi-forum.org/docs) ...**message passing parallelism**... - ...parametric ...coarse-grained outer loop ...essentially task parallel - ...structured domains ...decomposition with local operations - ...lots of communication ...load-balancing required - ...works well for large systems of equations [OpenMP](https://www.openmp.org/specifications) ...**shared memory parallelism**... - ...statically scheduled parallel loops - ...parallel regions ...dynamic load balance Some applications can take advantage of message passing and threads [OpenACC](https://www.openacc.org/specification) ...**data offload to an accelerator** (typically a GPU)... - ...directives to specify code and data ...compiler does the work... - ...initialize device & run-time environment - ...allocate memory on device ...move data from the host - ...launch the code (computational kernels) on the device - ...gather results from the device memory - ...deallocate data on device # Algorithm Classes Abstract application categories... * ...13 dwarfs of HPC * ...facilitate cross-platform reasoning * ...cross-application component reuse ## Dense Linear Algebra * ...densely-populated matrices (or vectors) * ...typically uncompressed * ...often accessed via strides * ...used for linear algebra... * ...matrix multiplication * ...LU decomposition * ...Gauss–Seidel method * ...Jacobi method * ...widely supported with libraries * ...(still) default measure for HPC performance ## Sparse Linear Algebra * ...sparsely-populated matrices (or vectors) * ...vast majority of data zeros * ...typically compressed * ...often indirect access by indices * ...used for... * ...conjugated gradient * ...ranking algorithms * ...data mining ## Spectral Methods * ...data in frequency domain ...not time & space * ...used for... * ...FFT (fast Fourier transform) * ...audio & video signal processing * ...usually implemented using butterfly patterns * ...often latency limited ...global communication ...all-to-all * ...relies on transposing data efficiently ## N-body Methods Model interactions between discreet moving points * ...require dynamic data structures * ...affects load balancing * ...data access latency * ...used for... * ...galaxy collision simulation * ...molecular dynamics * ...protein folding ## Structured Grids Model interaction between discreet fixed points... * ...grid structure described by patterns * ...topology information easily derived * ...usually high spatial locality * ...may be subdivided into finer grid * ...used for ...computational fluid dynamics * Characteristics... * ...structured nature key aspect * ...predictable memory addresses * ...local communication only * ...adaptive grid possible ## Unstructured Grids Model interactions between discreet fixed points... * ...grid described explicitly by individual connections * ...irregular geometry & topology * ...multiple levels of indirection when accessing data * ...used for ...computational fluid dynamics * Characteristics... * ...heavy latency bound...indirect data access * ... ## Monte Carlo Methods Models statistical evaluation of repeated random trials * ...process data independently...merge the result * ...also know as map-reduce * ...embarrassingly parallel ...multiple copies of sequential method * ...used for... * ...numerical integration * ...ray-tracing * ...quantum many-body problem ## Combinational Logic Involves performing simple operations on large amounts of integer data... * ...parallelized on multiple levels ...bit-level ...block-level * ...typically no global communication * ...used for ...computing CRC (cycling redundancy checks) ## Graph Traversal Traverse a number of objects in a graph... * ...examine characteristics * ...search * ...sort * ...collision detection * ...decision tress * ...usually heavy on data read/lookup * ...parallelizable over different paths ## Dynamic Programming Method of computing solutions by solving simpler, overlapping problems... * ...optimal result composed of optimal results of sub-problem * ...memoization...solve sub-problem once...reuse stored result * ...used for... * ...lookup tables * ...matrix-chain-multiplication ## Backtrack & Branch+Bound Search and optimization problems...very large problem spaces... * ...incrementally build solution...discard if determined unsuitable * ...uses divide & conquer strategy... * ...break down complex problems into smaller sub-problems * ...solve sub-problems in parallel ## Graphical Models Probabilistic models... * ...graph consisting of random variables as nodes * ...dependencies as edges * ...used for... * ...Bayesian networks * ...Hidden Markov models ## Finite State Machines Represent an interconnected set of states to be moved among * ...sometimes decomposed into multiple state machines * ...used for parsers # AI Algorithms Use of **AI** (Artificial Intelligence) and **ML** (Machine Learning) - ...also called... DL (Deep Learning) ...DNN (Deep Neural Networks) - Used for... - ...speech recognition and photo manipulation in smart phones - ...camera video processing for car driver support systems - ...video scaling in TVs and video gaming hardware - ...scientific computing in HPC environments - Dedicated accelerator hardware... - ...used in HPC, Cars, Smart TVs/Phones - ...**NPU** (Neural Processing Unit) or **TPU** (Tensor Processing Unit) aka Tensor cores Abbreviation for AI models... - NN (Neural Network) ...further divided into... - ANN (Artificial NN) - DNN (Deep NN) - CNN (Convolutional NN) - RNN (Recurrent NN) - SNN (Spiking NN) - LSTM (Long Short-Term Memory) - GPT (Generative Pretrained Transformer) - BERT (Bidirectional Encoder Representation from Transformers)