MPI Parallel Programing
MPI (Message Passing Interface) …https://www.mpi-forum.org/docs
- …(def factor) standard specification…
- …for writing message passing programs
- …parallelizing C, C++ and Fortran
- …it’s an API (Application Programming Interface)
- …library …not a programming language
- …defines the calling conventions…
- …to utilize another software module
- …provides a collection of functions…
- …allows IPC (Inter-process Communication(…
- …through an MPI communications “layer”
Standard for a message-passing parallel programming model…
- …send/receive messages between tasks/processes
- …can include data transmission and task synchronisation
- …parallelization on
- …multiple processors (single node)
- …across multiple nodes
Versions
Standard | Date |
---|---|
1.0 | 1994/05 |
1.1 | 1995/06 |
1.2 | 1997/07 |
1.3 | 2008/05 |
2.1 | 2008/09 |
2.2 | 2009/09 |
3.0 | 2012/09 |
3.1 | 2015/06 |
4.0 | 2021/06 |
MPI implementations…
- Open-source…
- OpenMPI… https://www.open-mpi.org
- MPICH… https://www.mpich.org
- MVAPICH… https://mvapich.cse.ohio-state.edu/
- Vendor-supported…
- Intel MPI
Concepts
Distributed Memory
…model… (opposite of shared memory)
- …each process has its own address space…
- …requires send/receive data for communication
- Communication speed determined by network…
- …message passing API typically optimized for communication hardware
- …no code changes required to the application
- Advantages…
- …parallelism in an application explicitly identified
- …scales to large numbers of processors
- …avoids difficulties with shared memory…cache coherency
- Disadvantages…
- …programmer responsible for controlling the data movement
- …NUMA (Non-Uniform Memory Access)
- …difficult to map global data structures…access patterns
Communicator
…define which collection of processes may communicate with each other
- …between processes in a group…between disjoint groups
- …default communicator
- …contains all processes
- …for example
MPI_COMM_WORLD
in C & Fortran
- …new communicators can be defined
- …any number can co-exist
- …topology can be updated dynamically
…provide following information…
- …number of processes within this communicator
- …numerical ranks for each process within the communicator
Rank
…sometimes called task ID…
- …within a communicator…each process has a unique identifier
- …integer numbered 0, 1, 2, …, N-1
- …used to specify the source and destination of messages
Use of rank and size during execution…
- …SPMD (Single Program, Multiple Data)…
- …each process is a duplicate of MPI executable
- …run-time environment for each process is not identical
- …processes work completely independently…
- …except when communicating
- …a process can check its own rank
Examples
C
Following a simple Open MPI example program hello_world.c
:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include "mpi.h"
int main(int argc,char ** argv )
{
int rank; // Rank of the calling process in the communicator group
int size; // Total number of processes in the communicator group
char hostname[1024]; // Hostname
// MPI_COMM_WORLD is the default communicator setup by MPI_Init()
( &argc, &argv );
MPI_Init( MPI_COMM_WORLD, &rank );
MPI_Comm_rank( MPI_COMM_WORLD, &size );
MPI_Comm_size= getpid(); // Get the process ID from the OS
pid_t pid (hostname, 1024); // Get the hostname
gethostname( "Hello world %s.%d [%d/%d]\n",hostname , pid, rank, size);
printf();
MPI_Finalizereturn 0;
}
Compile the using mpicc
an execute it locally with mpirun
:
mpicc -o hello_world hello_world.c
mpirun -np 2 hello_world
Python
Example program hello_world.py
..
from mpi4py import MPI
= MPI.COMM_WORLD
comm = comm.Get_rank()
rank = comm.Get_size()
size print("Rank %d of %d" %(rank, size));
Run the Python program with mpiexec
…
mpiexec -np 2 python hello_world.py
Communication
Point-to-Point
Sending data from one point (process/task) to another point (process/task)
- Send …blocking …returns when data send from buffered
- Receive …blocking …returns when data received from buffered
Synchronous vs. Asynchronous…
- Synchronous send…
- …wait until receiver is ready…
- …handshake …confirm a safe send…
- …blocking send and receive
- Buffered (asynchronous) send…
- …message send to system-controlled block of memory
- …sender continuous execution
- …receiver reads message when ready
Blocking vs. non-blocking…
- Blocking…
- …return when it is safe to use the buffer again
- …does not imply that the data was actually received
- …can be either synchronous or asynchronous (buffered)
- Non-blocking…
- …send and receive return immediately without waiting for communication
- …unsafe to modify the buffer without additional checks
- …aims for non-blocking calls to overlap computation
Combined send/receive…
- …two-way communication
- …executes a blocking send and a blocking receive operation
- …same communicator…distinct tag arguments
- …programmer needs to consider deadlocks
Collective
Need for collective communication…
- …one process sends message to all other processes
- Considerations & restrictions…
- …blocking
- …no use of message tags
- …sub-sets require grouping and new communicator
- …supports only MPI predefined data-types
Barrier synchronization …blocks until all processes have called
Data movements…
- Broadcast …sends data from root to all processes
- Scatter …distinct messages from a single process to all others
- Gather …distinct messages from all other process to a single process
- All gather…
- …concatenation of data to all processes
- …each process performs a broadcast operation to the other process
- All to all…
- …each process in a group performs a scatter operation
- …distinct message to all the tasks in order by index
Collective computation…
- Reduce…
- …reduction operation on all tasks in the communicator
- …places the result in one task
- All reduce …reduction operation and places the result in all tasks
OpenMPI (OMPI)
Component Architecture
MCA - Modular Component Architecture…
- …frameworks…
- …manage a single components type at run-time
- …components implement a framework interface (loaded as plugin at run-time)
- …modules are run-time instance of a component
- …parameters customize the run-time configuration with simple key-value pairs
- …find a comprehensive list of MCA frameworks in the README
- https://github.com/open-mpi/ompi/blob/master/README
Frameworks
Frameworks user care about…
- coll…MPI collectives
- io…MPI I/O
- pml…MPI point-to-point communications
ob1
…multi-device…multi-rail engine…usesbtl
cm
…engine to match network layer..usesmtl
ucx
…uses UCX communication library
- btl…byte transport layer
sm
…shared memorytcp
…over networks- Alternate path to…
ofi
…Libfarbic (OpenFabric Interface)uct
…UCX
- mtl…matching transport layer…
- …networks natively support MPI-style massage passing
- …inherently stateful…use only a single MTL
- …direct path to
ofi
…Libfarbic
- ..etc..
Change the behavior of code at run-time in following precedence:
- …command line arguments
mpirun -mca <name> <value>
- …environment variables
export OMPI_MCA_<name>=<value>
- …configuration files…
- …for example
$HOME/.openmpi/mca‐params.conf
- …INI-style key-value pairs
- …for example
# force OB1 BTLs
mpirun --mca pml ob1 -mca btl sm,tcp ...
# force UCX (for InfiniBand)
mpirun --mca pml ucx ...
ORTE Back-end
ORTE back-end run-time environment component frameworks:
ess
RTE environment-specific servicesplm
process life-cycle managementras
resource allocation systemschizo
OpenRTE personality framework
Install
Packages
Fedora/EPEL RPM package openmpi…
sudo dnf install -y openmpi openmpi-devel
source /etc/profile.d/modules.sh
module load mpi/openmpi-x86_64
Build custom RPM packages…
>>> yum groupinstall -y "Development Tools" binutils binutils-devel
# infiniband
>>> yum install -y libibcm-devel libibcommon-devel libibmad-devel libibumad-devel libibverbs-devel librdmacm-devel
# NUMA support, CPU affinity support
>>> yum install -y numactl-devel hwloc-devel
>>> wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0-1.src.rpm
>>> rpmbuild --rebuild openmpi-3.0.0-1.src.rpm
>>> ls ~/rpmbuild/RPMS/x86_64/openmpi*.rpm
/root/rpmbuild/RPMS/x86_64/openmpi-3.0.0-1.el7.centos.x86_64.rpm
# install the package and check the configuration
>>> rpm -i ~/rpmbuild/RPMS/x86_64/openmpi*.rpm
>>> ompi_info --param all all --level 9
Source
Requires two dependencies…libevent
& hwloc
- …available on most platforms…
- …(still) embedded copies of these dependencies
- …system (external) version used if available
- …force with
--with-libevent=internal
…
Sources…
# download the latest archive and extract it
wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
tar -xvf openmpi-3.0.0.tar.gz && cd openmpi-3.0.0
## select a target location for the installation
prefix=$PWD/openmpi/3.0.0 && mkdir -p $prefix
## make sure to have the required compilers (in a given version) in the shell enironment
./configure --prefix=$prefix 2>&1 | tee $prefix/configure.log
## adjust how many cores to use for the compilation with option -j
make -j $(nprocs) 2>&1 | tee $prefix/make.log
make install 2>&1 | tee $prefix/install.log
cp -R examples $prefix
## create a file to source this installation into a shell environment
echo "export PATH=$prefix/bin:$PATH" >> $prefix/source_me.sh
echo "export LD_LIBRARY_PATH=$prefix/lib:$LD_LIBRARY_PATH" >> $prefix/source_me.sh
echo "export MANPATH=$prefix/share/man:$MANPATH" >> $prefix/source_me.sh
source $prefix/source_me.sh
Configure script philosophy…
- …search for supported optional dependencies
- …if available…build support for them
- …skip if optional dependencies are not available
- …user can set dependencies with options…
--with-*
…fails if dependency missing--without-*
…skip looking for dependency
Compilers…usual GNU autoconf
…shell variables…CC
, CXX
, and FC
# best practics...compiler vars. right of the config token
./configure CC=/path/to/clang CSS=/path/to/clang++ FC=/path/to/gfortran
# ...preserves configuration in config.log
Components build as DSO (dynamic shared objects)…unless --disable-dlopen
ompi_info
…print information about the installation…
# list all frameworks, MCA paramters
ompi_info --param all all
# ^for a specific framework
ompi_info --param $framework all
# ^specific component
ompi_info --param $framework $component
Levels…
ompi_info --param $framework $component --level 9
- …end user…
1
basic…2
detailed…3
all - …application tuner…
4
basic…5
detailed…6
…all - …MPI developer…
7-9
Communication Libraries
Defaults to use internal TCP and shared memory libraries…
Over network fabrics…
- …use high level frameworks for network communication
- …abstract underlying network farbics and hardware devices
…two options available…
- Libfabric aka OpenFabric Interface…
- …enabled with option
--with-libfabric=
- …supports communication over…
- POSIX TCP/UDP
- InfiniBand
ibverbs
- …enabled with option
- UCX aka Unified Communication X…
- …next generation…higher-abstraction InfiniBand
- …enabled with option
--with-ucx=
… - …supports communication over…
- Shared memory
- POSIX TCP
- InfiniBand
ibverbs
- Autodetects Mellanox
mlx5
network driver
Launch Orchestration
Pros/cons of PMIx launch orchestration…
mpirun
…more options…- …larger range of PMIx support
- …dynamic job control & monitoring
- …historically OpenMPI specific
- …uses PRRTE with OpenMPI 5+
srun
(Slurm specific)…work across all MPI implementations