MPI Parallel Programing

HPC
Published

April 20, 2023

Modified

April 20, 2023

MPI (Message Passing Interface) …https://www.mpi-forum.org/docs

Standard for a message-passing parallel programming model…

Versions

Standard Date
1.0 1994/05
1.1 1995/06
1.2 1997/07
1.3 2008/05
2.1 2008/09
2.2 2009/09
3.0 2012/09
3.1 2015/06
4.0 2021/06

MPI implementations…

Concepts

Distributed Memory

…model… (opposite of shared memory)

  • …each process has its own address space…
  • …requires send/receive data for communication
  • Communication speed determined by network…
    • …message passing API typically optimized for communication hardware
    • …no code changes required to the application
  • Advantages…
    • …parallelism in an application explicitly identified
    • …scales to large numbers of processors
    • …avoids difficulties with shared memory…cache coherency
  • Disadvantages…
    • …programmer responsible for controlling the data movement
    • …NUMA (Non-Uniform Memory Access)
    • …difficult to map global data structures…access patterns

Communicator

…define which collection of processes may communicate with each other

  • …between processes in a group…between disjoint groups
  • …default communicator
    • …contains all processes
    • …for example MPI_COMM_WORLD in C & Fortran
  • …new communicators can be defined
    • …any number can co-exist
    • …topology can be updated dynamically

…provide following information…

  • …number of processes within this communicator
  • …numerical ranks for each process within the communicator

Rank

…sometimes called task ID…

  • …within a communicator…each process has a unique identifier
  • …integer numbered 0, 1, 2, …, N-1
  • …used to specify the source and destination of messages

Use of rank and size during execution…

  • SPMD (Single Program, Multiple Data)…
    • …each process is a duplicate of MPI executable
    • …run-time environment for each process is not identical
  • …processes work completely independently…
    • …except when communicating
    • …a process can check its own rank

Examples

Cf. Official OMPI Examples

C

Following a simple Open MPI example program hello_world.c:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include "mpi.h"

int main(int argc,char ** argv )
{
    int rank;               // Rank of the calling process in the communicator group
    int size;               // Total number of processes in the communicator group
    char hostname[1024];    // Hostname
    // MPI_COMM_WORLD is the default communicator setup by MPI_Init()  
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    pid_t pid = getpid();            // Get the process ID from the OS
    gethostname(hostname, 1024);     // Get the hostname
    printf( "Hello world %s.%d [%d/%d]\n",hostname , pid, rank, size);
    MPI_Finalize();
    return 0;
}

Compile the using mpicc an execute it locally with mpirun:

mpicc -o hello_world hello_world.c
mpirun -np 2 hello_world

Python

Example program hello_world.py..

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size() 
print("Rank %d of %d" %(rank, size));

Run the Python program with mpiexec

mpiexec -np 2 python hello_world.py

Communication

Point-to-Point

Sending data from one point (process/task) to another point (process/task)

  • Send …blocking …returns when data send from buffered
  • Receive …blocking …returns when data received from buffered

Synchronous vs. Asynchronous…

  • Synchronous send…
    • …wait until receiver is ready…
    • …handshake …confirm a safe send…
    • …blocking send and receive
  • Buffered (asynchronous) send…
    • …message send to system-controlled block of memory
    • …sender continuous execution
    • …receiver reads message when ready

Blocking vs. non-blocking…

  • Blocking…
    • …return when it is safe to use the buffer again
    • …does not imply that the data was actually received
    • …can be either synchronous or asynchronous (buffered)
  • Non-blocking…
    • …send and receive return immediately without waiting for communication
    • …unsafe to modify the buffer without additional checks
    • …aims for non-blocking calls to overlap computation

Combined send/receive…

  • …two-way communication
  • …executes a blocking send and a blocking receive operation
  • …same communicator…distinct tag arguments
  • …programmer needs to consider deadlocks

Collective

Need for collective communication…

  • …one process sends message to all other processes
  • Considerations & restrictions…
    • …blocking
    • …no use of message tags
    • …sub-sets require grouping and new communicator
    • …supports only MPI predefined data-types

Barrier synchronization …blocks until all processes have called

Data movements…

  • Broadcast …sends data from root to all processes
  • Scatter …distinct messages from a single process to all others
  • Gather …distinct messages from all other process to a single process
  • All gather…
    • …concatenation of data to all processes
    • …each process performs a broadcast operation to the other process
  • All to all…
    • …each process in a group performs a scatter operation
    • …distinct message to all the tasks in order by index

Collective computation…

  • Reduce
    • …reduction operation on all tasks in the communicator
    • …places the result in one task
  • All reduce …reduction operation and places the result in all tasks

OpenMPI (OMPI)

Component Architecture

MCA - Modular Component Architecture…

  • frameworks
    • …manage a single components type at run-time
    • …components implement a framework interface (loaded as plugin at run-time)
    • …modules are run-time instance of a component
  • parameters customize the run-time configuration with simple key-value pairs

Frameworks

Frameworks user care about…

  • coll…MPI collectives
  • io…MPI I/O
  • pml…MPI point-to-point communications
    • ob1…multi-device…multi-rail engine…uses btl
    • cm…engine to match network layer..uses mtl
    • ucx…uses UCX communication library
  • btl…byte transport layer
    • sm…shared memory
    • tcp…over networks
    • Alternate path to…
      • ofi…Libfarbic (OpenFabric Interface)
      • uct…UCX
  • mtl…matching transport layer…
    • …networks natively support MPI-style massage passing
    • …inherently stateful…use only a single MTL
    • …direct path to ofi…Libfarbic
  • ..etc..

Change the behavior of code at run-time in following precedence:

  • …command line arguments mpirun -mca <name> <value>
  • …environment variables export OMPI_MCA_<name>=<value>
  • …configuration files…
    • …for example $HOME/.openmpi/mca‐params.conf
    • …INI-style key-value pairs
# force OB1 BTLs
mpirun --mca pml ob1 -mca btl sm,tcp ...

# force UCX (for InfiniBand)
mpirun --mca pml ucx ...

ORTE Back-end

ORTE back-end run-time environment component frameworks:

  • ess RTE environment-specific services
  • plm process life-cycle management
  • ras resource allocation system
  • schizo OpenRTE personality framework

Install

Packages

Fedora/EPEL RPM package openmpi

sudo dnf install -y openmpi openmpi-devel
source /etc/profile.d/modules.sh
module load mpi/openmpi-x86_64

Build custom RPM packages…

>>> yum groupinstall -y "Development Tools" binutils binutils-devel
# infiniband
>>> yum install -y libibcm-devel libibcommon-devel libibmad-devel libibumad-devel libibverbs-devel librdmacm-devel
# NUMA support, CPU affinity support
>>> yum install -y numactl-devel hwloc-devel
>>> wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0-1.src.rpm
>>> rpmbuild --rebuild openmpi-3.0.0-1.src.rpm
>>> ls ~/rpmbuild/RPMS/x86_64/openmpi*.rpm
/root/rpmbuild/RPMS/x86_64/openmpi-3.0.0-1.el7.centos.x86_64.rpm
# install the package and check the configuration
>>> rpm -i ~/rpmbuild/RPMS/x86_64/openmpi*.rpm
>>> ompi_info --param all all --level 9

Source

Requires two dependencies…libevent & hwloc

  • …available on most platforms…
  • …(still) embedded copies of these dependencies
    • …system (external) version used if available
    • …force with --with-libevent=internal

Sources…

# download the latest archive and extract it
wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
tar -xvf openmpi-3.0.0.tar.gz && cd openmpi-3.0.0

## select a target location for the installation
prefix=$PWD/openmpi/3.0.0 && mkdir -p $prefix

## make sure to have the required compilers (in a given version) in the shell enironment
./configure --prefix=$prefix 2>&1 | tee $prefix/configure.log

## adjust how many cores to use for the compilation with option -j
make -j $(nprocs) 2>&1 | tee $prefix/make.log
make install 2>&1 | tee $prefix/install.log
cp -R examples $prefix

## create a file to source this installation into a shell environment
echo "export PATH=$prefix/bin:$PATH" >> $prefix/source_me.sh
echo "export LD_LIBRARY_PATH=$prefix/lib:$LD_LIBRARY_PATH" >> $prefix/source_me.sh
echo "export MANPATH=$prefix/share/man:$MANPATH" >> $prefix/source_me.sh
source $prefix/source_me.sh

Configure script philosophy…

  • …search for supported optional dependencies
  • …if available…build support for them
  • …skip if optional dependencies are not available
  • …user can set dependencies with options…
    • --with-*…fails if dependency missing
    • --without-*…skip looking for dependency

Compilers…usual GNU autoconf…shell variables…CC, CXX, and FC

# best practics...compiler vars. right of the config token
./configure CC=/path/to/clang CSS=/path/to/clang++ FC=/path/to/gfortran
# ...preserves configuration in config.log

Components build as DSO (dynamic shared objects)…unless --disable-dlopen

ompi_info

…print information about the installation…

# list all frameworks, MCA paramters
ompi_info --param all all
# ^for a specific framework
ompi_info --param $framework all
# ^specific component
ompi_info --param $framework $component

Levels…

ompi_info --param $framework $component --level 9
  • …end user…1 basic…2 detailed…3 all
  • …application tuner…4 basic…5 detailed…6…all
  • …MPI developer…7-9

Communication Libraries

Defaults to use internal TCP and shared memory libraries…

Over network fabrics…

  • …use high level frameworks for network communication
  • …abstract underlying network farbics and hardware devices

…two options available…

  • Libfabric aka OpenFabric Interface…
    • …enabled with option --with-libfabric=
    • …supports communication over…
      • POSIX TCP/UDP
      • InfiniBand ibverbs
  • UCX aka Unified Communication X…
    • …next generation…higher-abstraction InfiniBand
    • …enabled with option --with-ucx=
    • …supports communication over…
      • Shared memory
      • POSIX TCP
      • InfiniBand ibverbs
    • Autodetects Mellanox mlx5 network driver

Launch Orchestration

Pros/cons of PMIx launch orchestration…

  • mpirun…more options…
    • …larger range of PMIx support
    • …dynamic job control & monitoring
    • …historically OpenMPI specific
    • …uses PRRTE with OpenMPI 5+
  • srun (Slurm specific)…work across all MPI implementations