Parallel Computing

HPC
Published

November 14, 2021

Modified

July 4, 2023

Number of scientific applications using parallel computing is growing …capabilities range from…

What is HPC & HTC?

Compute clusters aka parallel computers called…

HPC (High-Performance Computer) or Super-Computer

  • …many compute nodes connected by a fast network
  • …each node provides multiple CPUs (and GPUs)
  • …needs application specifically build with parallelism
    • …to use multiple CPU cores…
    • …across multiple compute nodes

HTC (High-Throughput Computing)…

  • …workload using HPC computing infrastructure
  • …run may single-core instances of an application
    • …barely parallel …no communication between tasks required
    • …goal is to achieve scaling by the number of CPUs

What is Parallel Programming?

Aka performance programming…

  • Utilize the ability of an algorithm to execute in parallel…
    • task parallelism …break large operations into smaller tasks
    • data parallelism …same operation on multiple data elements
  • …with the goal to improve performance on modern hardware

Amdahl’s Law

Defines (upper limit on) speedup of code as a function of…

  • …proportion of code that can be parallelized
  • …and the number of processors used
  • …speedup depends only on the parallel content

Divide program in…

  • …sequential part S (can not be improved through parallelism)
  • …parallel part P …with S + P = 1
  • …speedup 1/(S + (P/n))n number of cores

Most parallel applications do not scale to thousands cores…

Strategies to achieve high scalability

  • …grow problem size with number of cores…Gustafson’s Law
  • …increase ratio of computation to communication
  • …overlap communication with computation
  • …dynamic load balancing to assign work to idle cores

Processes & Threads

…both independent sequences of execution…

  • Process…
    • …instance of a program
    • …runs on dedicated “owned” memory
    • …holds state and file descriptors
    • …more overhead then thread
    • …more flexible…
      • …multiple parallel processes on single computer
      • …or across multiple compute nodes (distributed memory)
  • Threads…
    • …lightweight entity …executes inside process
    • …each process has at least on threat
    • …shared memory for all threat in a process
    • …less overhead then a process
    • …less flexible …confined to a single process/node

Parallel Memory Access

…determines how/where to execute code…

  • Shared memory
    • …threads …run on a single node
    • …typically used implementing OpenMP (supported by most compilers)
  • Distributed memory…
    • …multiple processes …program instances
    • …executed on one or more nodes
    • …typically used by implementing MPI (provided by a library)

Both models need to consider how processes/threads are mapped/bound to cores

  • Hybrid memory…
    • …balance between threads and processes
      • …shared memory to keep footprint small
      • …overcome scaling limits…run across multiple node
    • …typically involves OpenMP and MPI

Standards-based Parallelism

Three ways to accelerate applications with parallelism…

  • …programming library …easy to use
  • …compiler directives …easy to use & portable code
  • …programming language …most performance …most flexible

MPImessage passing parallelism

  • …parametric …coarse-grained outer loop …essentially task parallel
  • …structured domains …decomposition with local operations
  • …lots of communication …load-balancing required
  • …works well for large systems of equations

OpenMPshared memory parallelism

  • …statically scheduled parallel loops
  • …parallel regions …dynamic load balance

Some applications can take advantage of message passing and threads

OpenACCdata offload to an accelerator (typically a GPU)…

  • …directives to specify code and data …compiler does the work…
    • …initialize device & run-time environment
    • …allocate memory on device …move data from the host
    • …launch the code (computational kernels) on the device
    • …gather results from the device memory
    • …deallocate data on device

Algorithm Classes

Abstract application categories…

  • …13 dwarfs of HPC
  • …facilitate cross-platform reasoning
  • …cross-application component reuse

Dense Linear Algebra

  • …densely-populated matrices (or vectors)
    • …typically uncompressed
    • …often accessed via strides
  • …used for linear algebra…
    • …matrix multiplication
    • …LU decomposition
    • …Gauss–Seidel method
    • …Jacobi method
  • …widely supported with libraries
  • …(still) default measure for HPC performance

Sparse Linear Algebra

  • …sparsely-populated matrices (or vectors)
    • …vast majority of data zeros
    • …typically compressed
    • …often indirect access by indices
  • …used for…
    • …conjugated gradient
    • …ranking algorithms
    • …data mining

Spectral Methods

  • …data in frequency domain …not time & space
  • …used for…
    • …FFT (fast Fourier transform)
    • …audio & video signal processing
  • …usually implemented using butterfly patterns
  • …often latency limited …global communication …all-to-all
  • …relies on transposing data efficiently

N-body Methods

Model interactions between discreet moving points

  • …require dynamic data structures
    • …affects load balancing
    • …data access latency
  • …used for…
    • …galaxy collision simulation
    • …molecular dynamics
    • …protein folding

Structured Grids

Model interaction between discreet fixed points…

  • …grid structure described by patterns
    • …topology information easily derived
    • …usually high spatial locality
  • …may be subdivided into finer grid
  • …used for …computational fluid dynamics
  • Characteristics…
    • …structured nature key aspect
    • …predictable memory addresses
    • …local communication only
    • …adaptive grid possible

Unstructured Grids

Model interactions between discreet fixed points…

  • …grid described explicitly by individual connections
  • …irregular geometry & topology
  • …multiple levels of indirection when accessing data
  • …used for …computational fluid dynamics
  • Characteristics…
    • …heavy latency bound…indirect data access

Monte Carlo Methods

Models statistical evaluation of repeated random trials

  • …process data independently…merge the result
  • …also know as map-reduce
  • …embarrassingly parallel …multiple copies of sequential method
  • …used for…
    • …numerical integration
    • …ray-tracing
    • …quantum many-body problem

Combinational Logic

Involves performing simple operations on large amounts of integer data…

  • …parallelized on multiple levels …bit-level …block-level
  • …typically no global communication
  • …used for …computing CRC (cycling redundancy checks)

Graph Traversal

Traverse a number of objects in a graph…

  • …examine characteristics
    • …search
    • …sort
    • …collision detection
    • …decision tress
  • …usually heavy on data read/lookup
  • …parallelizable over different paths

Dynamic Programming

Method of computing solutions by solving simpler, overlapping problems…

  • …optimal result composed of optimal results of sub-problem
  • …memoization…solve sub-problem once…reuse stored result
  • …used for…
    • …lookup tables
    • …matrix-chain-multiplication

Backtrack & Branch+Bound

Search and optimization problems…very large problem spaces…

  • …incrementally build solution…discard if determined unsuitable
  • …uses divide & conquer strategy…
    • …break down complex problems into smaller sub-problems
    • …solve sub-problems in parallel

Graphical Models

Probabilistic models…

  • …graph consisting of random variables as nodes
  • …dependencies as edges
  • …used for…
    • …Bayesian networks
    • …Hidden Markov models

Finite State Machines

Represent an interconnected set of states to be moved among

  • …sometimes decomposed into multiple state machines
  • …used for parsers

AI Algorithms

Use of AI (Artificial Intelligence) and ML (Machine Learning)

  • …also called… DL (Deep Learning) …DNN (Deep Neural Networks)
  • Used for…
    • …speech recognition and photo manipulation in smart phones
    • …camera video processing for car driver support systems
    • …video scaling in TVs and video gaming hardware
    • …scientific computing in HPC environments
  • Dedicated accelerator hardware…
    • …used in HPC, Cars, Smart TVs/Phones
    • NPU (Neural Processing Unit) or TPU (Tensor Processing Unit) aka Tensor cores

Abbreviation for AI models…

  • NN (Neural Network) …further divided into…
    • ANN (Artificial NN)
    • DNN (Deep NN)
    • CNN (Convolutional NN)
    • RNN (Recurrent NN)
    • SNN (Spiking NN)
  • LSTM (Long Short-Term Memory)
  • GPT (Generative Pretrained Transformer)
  • BERT (Bidirectional Encoder Representation from Transformers)