Running Analysis Pipelines - guide to containerized MPI-parallel workflows#

Introduction#

Analysis pipelines for large-scale astronomical data processing can consist of a variety of specialized software tools and libraries. The latter may have been developed under operating systems and software environments that may differ from the actual runtime environment, i.e. a high-performance computing (HPC) cluster like Ramses. Therefore, it may be beneficial to wrap workflows in containers. Containerization is the packaging of all your analysis code with just the operating system (OS) libraries and dependencies required to run the code. A container is essentially a single lightweight executable that runs consistently on any infrastructure.

When the container concept is extended to MPI-parallel applications, then some of the independence from the hosting OS gets lost, owing mainly to the fact that MPI applications depend on low-level system libraries. For example, efficient MPI communication often relies on low-latency network hardware like InfiniBand, which requires specific drivers and libraries to be available inside the container at both build and runtime.

This guide describes the creation of workflow containers for MPI-parallel applications, where the focus is on employing mpi4py, the MPI bindings for Python, as a high-level programming interface. Details are provided for building an Apptainer/Singularity image that features

  • Uses the host’s MPI, according to Apptainer’s Bind model

  • mpi4py compiled within Apptainer

  • Leverages Infiniband via UCX/UCC

A prerequisite for the material covered below is some general understanding of

Container basics and Apptainer/Singularity#

In principle, containerization involves three steps:

  1. Define the container image (OS, dependencies, environment) via a definition file, i.e., what will the container do?

  2. Build the container image from the definition file

  3. Instantiate a container from that image and run on the host system

Apptainer (formerly Singularity) is a containerization technology designed for HPC environments. It is more adequate for HPC systems than Docker mainly because root privileges are not required to run containers.

The first step in containerization is to create a container image. Container images involve image definition files that specify, among other things, the base OS image. The Dockerfile is Docker’s default name for such a definition file. In Apptainer, the Dockerfile counterpart is called Singularity Definition File (SDF). SDFs typically have the file extension *.def.

A container image is then created from an SDF, producing a read-only template containing instructions for spawning a container. The container image encapsulates the filesystem, environment, and metadata. In Apptainer, images are mostly stored in a format referred to as SIF (Singularity Image Format), using the extension *.sif. Hence, a container is an instance of an image that runs as a process on the host system.

A container is not started the same way as running a compiled binary, but rather through the container engine, which basically is a container manager. The engine provides the execution environment for container images and virtualizes the resources for containerized applications. For example, in the Windows OS, Docker Desktop is a container engine. In Apptainer, the container engine is part of the apptainer command-line tool. So a command like

apptainer exec myIMAGE.sif python myscript.py

lets Apptainer instantiate a container from the template myIMAGE.sif and executes the Python code myscript.py, where the latter runs inside the container, i.e., it uses the OS, filesystem and environment as defined in the image.

Note that in the following, the words container and apptainer are used synonymously in contexts that distinguish between operations that happen inside or outside (on the host) of a container, i.e., one may read inside/outside Apptainer ….

MPI and mpi4py (MPI for Python)#

Apptainer currently supports two open-source implementations of MPI; these are OpenMPI and MPICH. This document focuses on OpenMPI. MPI stands for Message Passing Interface (MPI) and is a common standard for performing communication across parallel computing architectures, which can be compute nodes of a single system or across compute platforms. MPI comes in form of libraries with a low-level set of commands that enable message passing, parallel IO, etc. Owing to its low-level nature, MPI is often used in conjunction with compiled HPC-suitable programming languages like C/C++ or Fortran.

As opposed to more low-level scientific programming languages, Python is an ideal candidate for implementing the higher-level parts of compute-intensive applications. MPI for Python (mpi4py) has thus evolved in order to provide MPI bindings for Python programs. mpi4py builds on the MPI specification and provides an object oriented interface based on MPI-2 C++ bindings (see also MPI for Python).

MPI + Apptainer#

According to the Apptainer+MPI documentation, one distinguishes between two different ways of joining Apptainer with MPI. The first way is referred to as Host-MPI model, also called Hybrid model, which is useful when shipping containerized MPI-parallel applications. This model involves a containerized MPI working in conjunction with the host’s MPI. Therefore, the name Hybrid appears more suitable than Host-MPI as the latter may suggest involvement of only a host MPI (personal opinion of the author).

The second method, referred to as Bind-model and explained in more detail below, involves mounting the host’s MPI into Apptainer. Hence, the containerized application uses the host MPI directly, that is, without a containerized MPI layer.

Model 1: Hybrid model#

The essence of the Hybrid model is that one executes some containerized MPI program, which in practice is realized as follows:

$ mpirun -n <NRANKS> apptainer exec <IMAGE> </PATH/TO/MPI-PROGRAM/WITHIN/APPTAINER>

where <IMAGE> is the Apptainer image (residing on the host). Here, mpirun is executed on the host, launched on the apptainer command itself. Hence, each MPI rank is launched as a separate container process; in other words, the host’s process manager daemon (ORTED in OpenMPI) launches Apptainer containers for each MPI rank. Inside Apptainer, the application (/PATH/TO/MPI-PROGRAM/WITHIN/APPTAINER) loads containerized MPI libraries which connect back to ORTED via PMI (Process Management Interface). This procedure results in MPI communication across container boundaries. It is the necessary way if directly mounting the host MPI into Apptainer is not possible due to security policies. The other case with less strict policies leads to the second model, called Bind model.

Model 2: Bind model#

The essence of the Bind model involves mounting the host’s MPI into Apptainer, again expressed via the mpirun command:

$ export MPI_DIR=<PATH/TO/HOST/MPI/DIRECTORY>
$ mpirun -n <NRANKS> apptainer exec --bind $MPI_DIR <IMAGE> </PATH/TO/MPI-PROGRAM/WITHIN/CONTAINER>

Similar to the hybrid approach, the bind-approach starts the MPI application by calling the MPI launcher (mpirun) from the host. However, note that the major difference is the bind/mount of the host MPI through --bind $MPI_DIR. Figuratively speaking, the host sets the stage by launching ranks, providing (MPI) libraries, etc., while the container just brings its own environment, i.e. Python code and other dependencies, not including an MPI layer. This can make such containers more lightweight, as they do not include any MPI implementation.

Comparison Hybrid Model vs Bind Model in Apptainer#

The Hybrid model’s advantage of providing a higher degree of (containerized) autonomy comes with the disadvantage of a compatibility requirement between container MPI and host MPI. The keyword is ABI (Application Binary Interface) compatibility. Basically, ABI provides a low-level specification that defines how compiled programs interact with the system and with each other at the binary level. ABI compatibility of two (separately compiled) binaries ensures matching specifications on data type sizes and alignment, function calling conventions, system call numbers, register usage, stack layout, and exception handling. Non-compatibility may manifest in elusive runtime errors or segmentation faults despite a seemingly successful compilation process.

Feature

Hybrid Model

Bind Model

MPI inside container?

Containerized MPI installation interacts with host MPI

Container uses host MPI via bind mount

MPI launcher location

Host - mpirun launches apptainer (containers)

Host - mpirun launches apptainer

MPI process management

OpenMPI daemon (ORTED) launches containers and connects via Process Management Interface (PMI)

No ORTED inside container — host MPI handles everything

Compatibility needs

⚠️ Container MPI must be ABI-compatible with host MPI

✅ Compatibility given through host MPI

Performance tuning

⚠️ Container MPI must be configured for host hardware, e.g., UCX, verbs

✅ Host MPI already tuned for hardware

Container size

📦 Larger - includes MPI stack

📦 Smaller - no MPI inside container

Use case

When bind-mounts are restricted, host MPI not accessible, for portability and full container isolation

When bind mounts are allowed, for simplicity and lightweight containers

This Howto focuses on the Bind model. As the Bind model’s mpirun template above shows, this requires two steps:

  1. Know where the MPI implementation on the host is installed (MPI_DIR)

  2. Mount/bind the host MPI into the container in a location where the (container’s) system will be able to find libraries and binaries.

If one wishes to unlock the full MPI performance of a given HPC system, Step 2 brings up another layer of bind-mounts not yet mentioned, while applicable to probably the majority of HPC systems. On a hardware level, many HPC systems use fast low-latency network technology like InfiniBand. It is desirable for MPI-containers to stay on the Infiniband highway, i.e., not have to revert to a slower network alternative. This will require the software translators that link your MPI programs to the InfiniBand hardware to be available inside a container. Such translators are UCX and UCC.

What is UCX and UCC?#

UCX (Unified Communication X) is a high-performance messaging layer that accelerates data transfer across HPC systems. It supports transports like Infiniband, shared memory, TCP, and GPU interconnects. UCX acts as the backbone for MPI implementations, accelerating data movement between nodes and devices.

UCC (Unified Collective Communication) builds on UCX by optimizing collective operations like broadcast, reduce, and barrier, as these are essential for scaling parallel applications. UCC ensures that these operations run efficiently across modern interconnects like Infiniband and NVLink.

Together, UCX + UCC allow MPI libraries and higher-level programs (using mpi4py) to fully exploit hardware acceleration for both individual and group communications. From low-level to high-level, the stack thus reads:
Hardware (i.e., Infiniband, NVLink) ➡️ UCX (Transports) ➡️ UCC (Collectives) ➡️ MPI Library (e.g., OpenMPI) ➡️ Python (mpi4py)

Without going into more detail, the basic rule is: To leverage high-performance messaging layers, that is, to use Infiniband via UCX/UCC, an MPI-capable container requires the exposure of UCX/UCC-related libraries at build and runtime. Here, build time refers to the compilation of mpi4py inside Apptainer, that is, one cannot use a pre-built mpi4py that was compiled without UCX/UCC support. Without UCX/UCC support, our MPI program (or mpi4py) would fall back to slower, less capable transports like TCP-only, or it may fail to initialize. The following section describes the steps for building a UCX/UCC-capable Apptainer image for the Bind-model approach.

Building a sample Bind-model container#

Scripts and supporting files are on the Ramses HPC cluster, directory:
/projects/sw/Apptainer/Build_Examples/build_mpi4py/

Step 1) Build base sandbox from Apptainer definition file#

The starting point is the following SDF (Singularity/Apptainer definition file), to be found under /projects/sw/Apptainer/Build_Examples/build_mpi4py/centos_stream9.def

Bootstrap: docker
From: rockylinux:9

%labels
    Author Michael Commer @ ITCC
    Purpose "Python + mpi4py container using host MPI/RDMA stack on RHEL9"

%environment
    export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
    export PATH=${MPI_PATH}/bin:/opt/venv/bin:$PATH
    export LD_LIBRARY_PATH=${MPI_PATH}/lib:/opt/lib:/opt/ucc:/opt/ucx:/opt/gcc:\
    /opt/hwloc:/opt/libfabric:/opt/numactl:/opt/gpfs:/opt/binutils:/opt/zlib:\
    /opt/libxml2:/opt/libpciaccess:$LD_LIBRARY_PATH
    export PYTHONPATH=/opt/venv/lib/python3.9/site-packages:$PYTHONPATH

%post
    # Install system dependencies
    dnf install -y --allowerasing \
        python3 python3-pip python3-devel \
        gcc gcc-c++ make \
        util-linux git curl ca-certificates \
        && dnf clean all

    # Create mount points for host MPI and libraries
    export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
    for d in lib ucc ucx gcc hwloc libfabric numactl gpfs binutils zlib libxml2 libpciaccess; do
      mkdir -p /opt/$d
    done
    mkdir -p /etc/libibverbs.d $MPI_PATH

    # Create and activate Python virtual environment
    python3 -m venv /opt/venv
    source /opt/venv/bin/activate

%files
# copy specific MPI-related/RDMA driver libs into the container
/lib64/libefa.so.1 /opt/lib/libefa.so.1
/lib64/libibverbs.so.1 /opt/lib/libibverbs.so.1
/lib64/libm.so.6 /opt/lib/libm.so.6
/lib64/libnl-3.so.200 /opt/lib/libnl-3.so.200
/lib64/libnl-route-3.so.200 /opt/lib/libnl-route-3.so.200
/lib64/libpmi.so.0 /opt/lib/libpmi.so.0
/lib64/libpmi2.so.0 /opt/lib/libpmi2.so.0
/lib64/librdmacm.so.1 /opt/lib/librdmacm.so.1
/lib64/libresolv.so.2 /opt/lib/libresolv.so.2
/lib64/libuuid.so.1 /opt/lib/libuuid.so.1
/lib64/libz.so.1 /opt/lib/libz.so.1
/usr/lib64/libibverbs/libmlx5-rdmav34.so /opt/lib/libmlx5-rdmav34.so
/usr/lib64/slurm/libslurm_pmi.so /opt/lib/libslurm_pmi.so

Building the sandbox is done as follows:

apptainer -v build --fakeroot --sandbox sandbox_centos_stream9 centos_stream9.def

As a shortcut, you can use the helper-script command ./build.sh bs instead. When done, you will see an unpacked directory tree sandbox_centos_stream9/, referred to as a sandbox (image). Such a sandbox is useful for development and debugging because you can always re-enter it via apptainer shell --writable sandbox_centos_stream9/ and install/fix things inside without rebuilding.

Step 2) Compile mpi4py inside sandbox#

Now, we want to rebuild mpi4py against the host OpenMPI+UCX+UCC stack. This step involves

  • loading the appropriate MPI-module:

    module purge && module load mpi/OpenMPI/4.1.5-GCC-12.3.0

  • bind-mouting MPI:

    ... --bind ${MPI_PATH}:${MPI_PATH}:ro

  • bind-mounting UCX/UCC and other MPI-related libraries, for example:

    ... --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro

You can perform these steps via the helper-script: ./build.sh is, which esentially performs all necessary bind-mounts:

# Enter sandbox and manually compile mpi4py
MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
FMOD=mpi/OpenMPI/4.1.5-GCC-12.3.0
module purge && module load $FMOD
# ...
apptainer shell --writable \
  --bind ${MPI_PATH}:${MPI_PATH}:ro \
  --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \
  --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \
  --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \
  --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \
  --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \
  --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \
  --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \
  --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \
  --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \
  --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \
  --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \
  --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \
sandbox_centos_stream9

Once inside the sandbox (at the prompt Apptainer>), launch the mpi4py compilation:

Apptainer> LDFLAGS="-L/opt/ucx -L/opt/ucc -lucp -lucs -luct -lucc" pip install --no-binary=mpi4py --no-cache-dir --force-reinstall mpi4py

Building mpi4py inside Apptainer involves compiling Python bindings that link against the MPI libraries and their underlying (UCX and UCC) communication frameworks. In this case, it was found that the compiler and linker did not find the associated UCX/UCC libraries automatically, despite adequate settings of LD_LIBRARY_PATH, as LD_LIBRARY_PATH is not evaluated at compile time. Hence the additional passing of LDFLAGS="-L/opt/ucx ..." to pip install. On other systems, and with other versions of MPI/mpi4py etc., it may become a trial-and-error procedure to pass the correct linking-flags.

Step 2b, optional) Verify UCX/UCC transport inside Apptainer#

To make sure UCX libraries are present and contain required symbols:

Apptainer> nm -D /opt/ucx/libucs.so | grep ucs_mpool_params_reset

which should produce something like 00000000000233b0 T ucs_mpool_params_reset, where the T represents the state of a defined symbol. If missing, UCX may be outdated, incompatible or incorrectly linked. Afterwards, check UCC linkage to UCX:

Apptainer> ldd /opt/ucc/libucc.so.1 | grep libucs

which should produce something like libucs.so.0 => /opt/ucx/libucs.so.0 (0x00007fa8d88da000). Finally, you could try out mpi4py:

Apptainer> python
# Then run this 2-liner:
>>> from mpi4py import MPI
>>> print(f"Hello from rank {MPI.COMM_WORLD.Get_rank()} of {MPI.COMM_WORLD.Get_size()}")

Step 3) Build final image in Singularity image format (*.sif)#

When happy with the performance of the sandbox, you can build a final immutable image:

apptainer build centos_stream9.sif sandbox_centos_stream9

Again, this step can also be run through the helper-script: ./build.sh bf. Generally, keep the sandbox as long as you are still iterating and expect to configure and debug inside the container. Create the final .sif image once the build works and you want something stable and portable.

Step 3b, optional) Perform simple mpi4py benchmark#

You can run an interactive slurm job with two nodes in order to test node-to-node communication using the mpi4py.bench module:

salloc --nodes=2 --ntasks-per-node=1 --job-name=n2xt1 --time=00:30:00
# when interactive slurm job has started:
module load mpi/OpenMPI/4.1.5-GCC-12.3.0
./build.sh r # runs "mpi4py.bench pingpong"

Assigning one task per node forces communication across the two nodes. The helper-script command ./build.sh r is a shortcut to the command

m=67108868 # max packet size for benchmark test
appt_tools mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m

where appt_tools is another shortcut to the above mpirun ... apptainer exec --bind ... command. The tool appt_tools is described further below. In the current case, appt_tools tries to ease your life by creating and launching a script mpirun.sh containing the following lengthy mpirun-command:

MPI_DIR=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
FSIF=/projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif # copy of the final sif we build above
FEXE="python -m mpi4py.bench pingpong --max-size 67108868" # command to run inside Apptainer
mpirun -n 2 apptainer exec \
  --bind ${MPI_DIR}:${MPI_DIR}:ro \
  --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \
  --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \
  --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \
  --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \
  --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \
  --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \
  --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \
  --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \
  --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \
  --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \
  --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \
  --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \
$FSIF $FEXE

The pingpong test shows this kind of output:

# MPI PingPong Test
# Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
         1              0.62 | 1.6172415e-06 ± 1.6920e-07    10000
         2              1.23 | 1.6257193e-06 ± 1.9910e-07    10000
         4              2.47 | 1.6217059e-06 ± 2.0745e-07    10000
         8              4.85 | 1.6494426e-06 ± 1.5699e-07    10000
       ...
   1048576          11412.46 | 9.1879942e-05 ± 5.1434e-07     1000
   2097152          11831.73 | 1.7724815e-04 ± 4.8272e-07       10
   4194304          12080.45 | 3.4719765e-04 ± 3.8304e-07       10
   8388608          12209.32 | 6.8706595e-04 ± 3.6534e-07       10
  16777216          12272.32 | 1.3670782e-03 ± 4.2741e-07       10
  33554432          12296.29 | 2.7288256e-03 ± 6.3523e-06       10
  67108864          12322.43 | 5.4460724e-03 ± 7.1724e-07       10

where a convergence of the value Bandwidth [MB/s] towards ~12000 MB/s indicates the expected bandwidth with UCX/UCC transport (this holds for a modest load on the compute nodes).

A side remark is that one may see the following kind of message upon launching a containerized MPI-parallel application:

Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
-79797552 or higher.

OpenMPI uses PMIx to determine process locality, such as which CPU socket or NUMA domain a process is on. If PMIx cannot provide that info, e.g., due to an older version or missing integration, OpenMPI’s OFI (libfabric) transport layer fails to make smart NIC choices and falls back to round-robin or default behavior. Hence, this is only a performance advisory about seeing multiple NICs (e.g., Infiniband interfaces) that are equally close to the process. There is an internal setting

export FI_PROVIDER=verbs
export FI_VERBS_IFACE=ib0

carried out by appt_tools prior to launching mpirun which should help avoid the OFI/NIC-related advisory.

appt_tools - a helper tool for containerized/apptainerized mpirun/sbatch jobs#

One of the main purposes of appt_tools is to provide shortcuts to common apptainer command lines that involve mpirun through the Bind model.

$ mpirun -n <NRANKS> apptainer exec --bind "$MPI_DIR" <IMAGE> --bind ... </PATH/TO/MPI-PROGRAM/WITHIN/CONTAINER>

with potentially numerous --bind entries. While appt_tools has other functionalities, here only the mpirun-feature is described. This feature is activated via the mpirun subcommand:

appt_tools [OPTIONS] mpirun <YOUR-MPI-OPTIONS>

which creates and instantly launches a run script mpirun.sh. You can use the option -o <OUTFILE> to choose a different output script name. If you prefer not to launch the script right away, use

appt_tools [OPTIONS] mpirunx <YOUR-MPI-OPTIONS>

which will only create mpirun.sh. The helper-script build.sh contains two examples:

#
# appt_tools - mpirun Example 1: mpi4py.bench
m=67108868 # max packet size for benchmark test
# writes a bash script mpi4py.bench.sh and launches it
appt_tools -V -o mpi4py.bench.sh mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m
#
# appt_tools - mpirun Example 2: simple MPI ring-communication test
# writes a bash script mpirun.sh and launches it
exe=/projects/sw/Apptainer/usr/bin/mpiinitst.4.1.5-GCC-12.3.0.exe
appt_tools -V mpirun -n 2 $exe

which you can run with ./build.sh r1 and ./build.sh r2, respectively.

appt_tools runtime parameters#

In the output of ./build.sh r1 or ./build.sh r2, one notes the settings

MPI-module: mpi/OpenMPI/4.1.5-GCC-12.3.0           (already-loaded MPI-module mpi/* in parent-shell)
MPI_DIR:    /projects/.../OpenMPI/4.1.5-GCC-12.3.0 (module prepend-path PATH)
SIFFILE:    ./centos_stream9+mpi4py.sif            (Apptainer/Singularity-image-format file)

MPI_DIR is required to mount the host’s MPI into Apptainer and is evaluated from the setting for MPI-module. SIFFILE provides the image file (<IMAGE>) for the apptainer exec command. Different options exist for providing these values to appt_tools as outlined in the following.

appt_tools configuration (*.ini) file#

This section describes runtime parameters that are needed when launching apptainerized MPI programs. Different ways exist for providing runtime parameters to appt_tools. All runtime parameters have in common that they can be set via command-line arguments or configuration (*.ini) file settings. The appt_tools configuration (*.ini) file format is not exactly the same as the common INI file format as it has the general format

[VAR] # value
<VALUE>

or

[VAR] # list of values
<VALUE_1>
...
<VALUE_N>

[mod]: MPI-module / MPI_DIR#

The host’s MPI-module and its associated bind/mount directory MPI_DIR can be provided in different ways. The following 5 ways are evaluated in the listed order until a setting for MPI-module is found.
  • Command-line option -m / --module: You can provide the MPI-module via the command-line

    appt_tools -m mpi/OpenMPI/4.1.5-GCC-12.3.0 mpirun ...
    

    This will let appt_tools extract MPI_DIR from the PATH information of the specified module, here mpi/OpenMPI/4.1.5-GCC-12.3.0.

  • module load line present in slurm file: Only applicable for runs that involve the sbacth/sbatchx subcommand. See the appt_tools documentation, to be invoked via appt_tools -doc. - Setting [mod] in appt_tools configuration file. The configuration file is a file with the file suffix ini. An adequate entry would then be

    [mod] # MPI-module
    mpi/OpenMPI/4.1.5-GCC-12.3.0
    
  • MPI-module already loaded: If the above steps do not produce a setting for an MPI-module, the next attempt consists of figuring out if an adequate MPI-module is already loaded in the current environment. This is done internally via the module list command.

  • Environment variable MOD_MPI: The last option of providing MPI-module information consists of a variable setting like (in bash)

    export MOD_MPI=mpi/OpenMPI/4.1.5-GCC-12.3.0
    

    If working with always the same MPI-module, setting MOD_MPI in your shell initialization file (i.e., ~/.bashrc) might be useful.

[sif]: IMAGE file (*.sif)#

Setting the Apptainer image file (<IMAGE>) can also be done in different ways, which are evaluated in the following order.
  • Command-line option -s / --sif: You can provide an absolute or relative file path for <IMAGE>:

    appt_tools -s /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif mpirun ...
    
  • Setting [sif] in appt_tools configuration (*.ini) file. An adequate entry would be

    [sif] # SIFFILE
    /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif
    
  • Find the newest (if multiple files are present) *.sif file in current directory.

  • Environment variable APT_SIFFILE: The last option of providing SIFFILE consists of a variable setting like (in bash)

    export APT_SIFFILE=/path/to/your/file.sif
    

    If repeatedly working with the same SIFFILE, setting APT_SIFFILE in your shell initialization file (i.e., ~/.bashrc) might be useful.