Parallel Computing
Number of scientific applications using parallel computing is growing …capabilities range from…
- …solving physics calculations (from the smallest to the largest objects in the universe)
- …making better drugs for diseases like cancer and COVID-19
- …running comprehensive simulation applications at different scales
- …incorporating new data science approaches like machine learning (ML) and artificial intelligence (AI)
- …data-analysis and visualization of big data sets
What is HPC & HTC?
Compute clusters aka parallel computers called…
…HPC (High-Performance Computer) or Super-Computer
- …many compute nodes connected by a fast network
- …each node provides multiple CPUs (and GPUs)
- …needs application specifically build with parallelism
- …to use multiple CPU cores…
- …across multiple compute nodes
HTC (High-Throughput Computing)…
- …workload using HPC computing infrastructure
- …run may single-core instances of an application
- …barely parallel …no communication between tasks required
- …goal is to achieve scaling by the number of CPUs
What is Parallel Programming?
Aka performance programming…
- Utilize the ability of an algorithm to execute in parallel…
- …task parallelism …break large operations into smaller tasks
- …data parallelism …same operation on multiple data elements
- …with the goal to improve performance on modern hardware
Amdahl’s Law
Defines (upper limit on) speedup of code as a function of…
- …proportion of code that can be parallelized
- …and the number of processors used
- …speedup depends only on the parallel content
Divide program in…
- …sequential part
S
(can not be improved through parallelism) - …parallel part
P
…withS + P = 1
- …speedup
1/(S + (P/n))
…n
number of cores
Most parallel applications do not scale to thousands cores…
Strategies to achieve high scalability
- …grow problem size with number of cores…Gustafson’s Law
- …increase ratio of computation to communication
- …overlap communication with computation
- …dynamic load balancing to assign work to idle cores
Processes & Threads
…both independent sequences of execution…
- Process…
- …instance of a program
- …runs on dedicated “owned” memory
- …holds state and file descriptors
- …more overhead then thread
- …more flexible…
- …multiple parallel processes on single computer
- …or across multiple compute nodes (distributed memory)
- Threads…
- …lightweight entity …executes inside process
- …each process has at least on threat
- …shared memory for all threat in a process
- …less overhead then a process
- …less flexible …confined to a single process/node
Parallel Memory Access
…determines how/where to execute code…
- Shared memory
- …threads …run on a single node
- …typically used implementing OpenMP (supported by most compilers)
- Distributed memory…
- …multiple processes …program instances
- …executed on one or more nodes
- …typically used by implementing MPI (provided by a library)
Both models need to consider how processes/threads are mapped/bound to cores
- Hybrid memory…
- …balance between threads and processes
- …shared memory to keep footprint small
- …overcome scaling limits…run across multiple node
- …typically involves OpenMP and MPI
- …balance between threads and processes
Standards-based Parallelism
Three ways to accelerate applications with parallelism…
- …programming library …easy to use
- …compiler directives …easy to use & portable code
- …programming language …most performance …most flexible
MPI …message passing parallelism…
- …parametric …coarse-grained outer loop …essentially task parallel
- …structured domains …decomposition with local operations
- …lots of communication …load-balancing required
- …works well for large systems of equations
OpenMP …shared memory parallelism…
- …statically scheduled parallel loops
- …parallel regions …dynamic load balance
Some applications can take advantage of message passing and threads
OpenACC …data offload to an accelerator (typically a GPU)…
- …directives to specify code and data …compiler does the work…
- …initialize device & run-time environment
- …allocate memory on device …move data from the host
- …launch the code (computational kernels) on the device
- …gather results from the device memory
- …deallocate data on device
Algorithm Classes
Abstract application categories…
- …13 dwarfs of HPC
- …facilitate cross-platform reasoning
- …cross-application component reuse
Dense Linear Algebra
- …densely-populated matrices (or vectors)
- …typically uncompressed
- …often accessed via strides
- …used for linear algebra…
- …matrix multiplication
- …LU decomposition
- …Gauss–Seidel method
- …Jacobi method
- …widely supported with libraries
- …(still) default measure for HPC performance
Sparse Linear Algebra
- …sparsely-populated matrices (or vectors)
- …vast majority of data zeros
- …typically compressed
- …often indirect access by indices
- …used for…
- …conjugated gradient
- …ranking algorithms
- …data mining
Spectral Methods
- …data in frequency domain …not time & space
- …used for…
- …FFT (fast Fourier transform)
- …audio & video signal processing
- …usually implemented using butterfly patterns
- …often latency limited …global communication …all-to-all
- …relies on transposing data efficiently
N-body Methods
Model interactions between discreet moving points
- …require dynamic data structures
- …affects load balancing
- …data access latency
- …used for…
- …galaxy collision simulation
- …molecular dynamics
- …protein folding
Structured Grids
Model interaction between discreet fixed points…
- …grid structure described by patterns
- …topology information easily derived
- …usually high spatial locality
- …may be subdivided into finer grid
- …used for …computational fluid dynamics
- Characteristics…
- …structured nature key aspect
- …predictable memory addresses
- …local communication only
- …adaptive grid possible
Unstructured Grids
Model interactions between discreet fixed points…
- …grid described explicitly by individual connections
- …irregular geometry & topology
- …multiple levels of indirection when accessing data
- …used for …computational fluid dynamics
- Characteristics…
- …heavy latency bound…indirect data access
- …
Monte Carlo Methods
Models statistical evaluation of repeated random trials
- …process data independently…merge the result
- …also know as map-reduce
- …embarrassingly parallel …multiple copies of sequential method
- …used for…
- …numerical integration
- …ray-tracing
- …quantum many-body problem
Combinational Logic
Involves performing simple operations on large amounts of integer data…
- …parallelized on multiple levels …bit-level …block-level
- …typically no global communication
- …used for …computing CRC (cycling redundancy checks)
Graph Traversal
Traverse a number of objects in a graph…
- …examine characteristics
- …search
- …sort
- …collision detection
- …decision tress
- …usually heavy on data read/lookup
- …parallelizable over different paths
Dynamic Programming
Method of computing solutions by solving simpler, overlapping problems…
- …optimal result composed of optimal results of sub-problem
- …memoization…solve sub-problem once…reuse stored result
- …used for…
- …lookup tables
- …matrix-chain-multiplication
Backtrack & Branch+Bound
Search and optimization problems…very large problem spaces…
- …incrementally build solution…discard if determined unsuitable
- …uses divide & conquer strategy…
- …break down complex problems into smaller sub-problems
- …solve sub-problems in parallel
Graphical Models
Probabilistic models…
- …graph consisting of random variables as nodes
- …dependencies as edges
- …used for…
- …Bayesian networks
- …Hidden Markov models
Finite State Machines
Represent an interconnected set of states to be moved among
- …sometimes decomposed into multiple state machines
- …used for parsers
AI Algorithms
Use of AI (Artificial Intelligence) and ML (Machine Learning)
- …also called… DL (Deep Learning) …DNN (Deep Neural Networks)
- Used for…
- …speech recognition and photo manipulation in smart phones
- …camera video processing for car driver support systems
- …video scaling in TVs and video gaming hardware
- …scientific computing in HPC environments
- Dedicated accelerator hardware…
- …used in HPC, Cars, Smart TVs/Phones
- …NPU (Neural Processing Unit) or TPU (Tensor Processing Unit) aka Tensor cores
Abbreviation for AI models…
- NN (Neural Network) …further divided into…
- ANN (Artificial NN)
- DNN (Deep NN)
- CNN (Convolutional NN)
- RNN (Recurrent NN)
- SNN (Spiking NN)
- LSTM (Long Short-Term Memory)
- GPT (Generative Pretrained Transformer)
- BERT (Bidirectional Encoder Representation from Transformers)