Running Analysis Pipelines - guide to containerized MPI-parallel workflows
==========================================================================

Introduction
------------
Analysis pipelines for large-scale astronomical data processing can consist of a variety of
specialized software tools and libraries. The latter may have been developed under 
operating systems and software environments that may differ from the actual runtime environment, i.e.
a high-performance computing (HPC) cluster like Ramses.
Therefore, it may be beneficial to wrap workflows in containers. Containerization is the packaging 
of all your analysis code with just the operating system (OS) libraries and dependencies required to run 
the code. A container is essentially a single lightweight executable that runs consistently on 
any infrastructure.

When the container concept is extended to MPI-parallel applications, then some of the independence 
from the hosting OS gets lost, owing mainly to the fact that MPI applications depend on low-level system libraries.
For example, efficient MPI communication often relies on
low-latency network hardware like InfiniBand, which requires specific drivers and libraries to be
available inside the container at both build and runtime.

This guide describes the creation of workflow containers for MPI-parallel applications, where
the focus is on employing ``mpi4py``, the MPI bindings for Python, as a high-level programming interface.
Details are provided for building an Apptainer/Singularity image that features 

* Uses the **host’s** MPI, according to Apptainer’s **Bind model**
* ``mpi4py`` compiled within Apptainer
* Leverages Infiniband via UCX/UCC

A prerequisite for the material covered below is some general understanding of

* General containerization concepts
* General MPI-parallel concepts
* `Apptainer/Singularity <https://apptainer.org/docs/user/latest/introduction.html>`_ basics
* `Python/mpi4py <https://mpi4py.readthedocs.io/en/stable/intro.html>`_

..
   :`field name`:code:: interpreted text with explicit role as suffix
 See the `Python home page <https://www.python.org>`_ for info.

Container basics and Apptainer/Singularity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In principle, containerization involves three steps:

1. Define the container image (OS, dependencies, environment) via a definition file, i.e., what will the container do?
2. Build the container image from the definition file
3. Instantiate a container from that image and run on the host system

Apptainer (formerly Singularity) is a containerization technology designed for HPC environments.
It is more adequate for HPC systems than Docker mainly because root privileges are not
required to run containers.

The first step in containerization is to create a container image.
Container images involve image definition files that specify, among other things,
the base OS image. The ``Dockerfile`` is Docker's default name for such a definition file. 
In Apptainer, the Dockerfile counterpart is called Singularity Definition File (SDF).
SDFs typically have the file extension ``*.def``.

A container image is then created from an SDF, producing a read-only template containing instructions 
for spawning a container. The container image encapsulates the filesystem, environment, and metadata.
In Apptainer, images are mostly stored in a format referred to as SIF (Singularity Image Format),
using the extension ``*.sif``. Hence, a container is an instance of an image that runs as a process 
on the host system.

A container is not started the same way as running a compiled binary, but rather 
through the container engine, which basically is a container manager. The engine provides the 
execution environment for container images and virtualizes the resources for containerized applications.
For example, in the Windows OS, Docker Desktop is a container engine.
In Apptainer, the container engine is part of the ``apptainer`` command-line tool.
So a command like

.. code:: bash

   apptainer exec myIMAGE.sif python myscript.py
   
lets Apptainer instantiate a container from the template ``myIMAGE.sif`` and executes the Python
code ``myscript.py``, where the latter runs inside the container, i.e., it uses the OS,
filesystem and environment as defined in the image.

Note that in the following, the words container and apptainer are used synonymously in
contexts that distinguish between operations that happen inside or
outside (on the host) of a container, i.e., one may read *inside/outside Apptainer ...*.

MPI and ``mpi4py`` (MPI for Python)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Apptainer currently supports two open-source implementations of MPI; these are
OpenMPI and MPICH. This document focuses on OpenMPI. MPI stands for
Message Passing Interface (MPI) and is a common standard for performing
communication across parallel computing architectures, which can be
compute nodes of a single system or across compute platforms. MPI comes
in form of libraries with a low-level set of commands that enable
message passing, parallel IO, etc. Owing to its low-level nature, MPI is
often used in conjunction with compiled HPC-suitable programming
languages like C/C++ or Fortran.

As opposed to more low-level scientific programming languages, Python is
an ideal candidate for implementing the higher-level parts of
compute-intensive applications. MPI for Python (``mpi4py``) has thus
evolved in order to provide MPI bindings for Python programs. ``mpi4py``
builds on the MPI specification and provides an object oriented
interface based on MPI-2 C++ bindings (see also `MPI for
Python <https://mpi4py.readthedocs.io/en/stable/mpi4py.html>`__).

MPI + Apptainer
~~~~~~~~~~~~~~~

According to the `Apptainer+MPI
documentation <https://apptainer.org/docs/user/latest/mpi.html#>`__, one
distinguishes between two different ways of joining Apptainer with MPI.
The first way is referred to as *Host-MPI model*, also called *Hybrid
model*, which is useful when shipping containerized MPI-parallel
applications. This model involves a containerized MPI working in
conjunction with the host’s MPI. Therefore, the name *Hybrid* appears
more suitable than *Host-MPI* as the latter may suggest involvement of
only a host MPI (personal opinion of the author).

The second method, referred to as *Bind-model* and explained in more detail below,
involves mounting the host’s MPI into Apptainer. Hence, the containerized application uses 
the host MPI directly, that is, without a containerized MPI layer.

Model 1: Hybrid model
^^^^^^^^^^^^^^^^^^^^^

The essence of the Hybrid model is that one executes some containerized
MPI program, which in practice is realized as follows:

::

   $ mpirun -n <NRANKS> apptainer exec <IMAGE> </PATH/TO/MPI-PROGRAM/WITHIN/APPTAINER>

where ``<IMAGE>`` is the Apptainer image (residing on the host). Here,
``mpirun`` is executed on the host, launched on the apptainer command
itself. Hence, each MPI rank is launched as a separate container
process; in other words, the host’s process manager daemon (ORTED in
OpenMPI) launches Apptainer containers for each MPI rank. Inside
Apptainer, the application (``/PATH/TO/MPI-PROGRAM/WITHIN/APPTAINER``)
loads containerized MPI libraries which connect back to ORTED via PMI
(Process Management Interface). This procedure results in MPI
communication across container boundaries. It is the necessary way if
directly mounting the host MPI into Apptainer is not possible due to
security policies. The other case with less strict policies leads to the
second model, called *Bind model*.

Model 2: Bind model
^^^^^^^^^^^^^^^^^^^

The essence of the Bind model involves mounting the host’s MPI into
Apptainer, again expressed via the ``mpirun`` command:

::

   $ export MPI_DIR=<PATH/TO/HOST/MPI/DIRECTORY>
   $ mpirun -n <NRANKS> apptainer exec --bind $MPI_DIR <IMAGE> </PATH/TO/MPI-PROGRAM/WITHIN/CONTAINER>

Similar to the hybrid approach, the bind-approach starts the MPI
application by calling the MPI launcher (``mpirun``) from the host.
However, note that the major difference is the *bind/mount* of the host
MPI through ``--bind $MPI_DIR``. Figuratively speaking, the host sets
the stage by launching ranks, providing (MPI) libraries, etc., while the
container just brings its own environment, i.e. Python code and other
dependencies, not including an MPI layer. This can make such containers more lightweight, as they do
not include any MPI implementation.

Comparison Hybrid Model vs Bind Model in Apptainer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Hybrid model’s advantage of providing a higher degree of
(containerized) autonomy comes with the disadvantage of a compatibility
requirement between container MPI and host MPI. The keyword is ABI
(Application Binary Interface) compatibility. Basically, ABI provides a
low-level specification that defines how compiled programs interact with
the system and with each other at the binary level. ABI compatibility of
two (separately compiled) binaries ensures matching specifications on
data type sizes and alignment, function calling conventions, system call
numbers, register usage, stack layout, and exception handling.
Non-compatibility may manifest in elusive runtime errors or segmentation
faults despite a seemingly successful compilation process.

..
   Table for comparison of Hybrid vs Bind model MPI in Apptainer

+---------------------------+----------------------------------------------------------+------------------------------------------+
| Feature                   | **Hybrid Model**                                         | **Bind Model**                           |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **MPI inside container?** | Containerized MPI installation interacts with host MPI   | Container uses host MPI via bind mount   | 
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **MPI launcher location** | Host - ``mpirun`` launches apptainer (containers)        | Host - ``mpirun`` launches apptainer     |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **MPI process management**| OpenMPI daemon (ORTED) launches containers and connects  | No ORTED inside container —              |
|                           | via Process Management Interface (PMI)                   | host MPI handles everything              |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **Compatibility needs**   | ⚠️ Container MPI must be ABI-compatible with host MPI    | ✅ Compatibility given through host MPI  |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **Performance tuning**    | ⚠️ Container MPI must be configured for host hardware,   | ✅ Host MPI already tuned for hardware   |
|                           | e.g., UCX, verbs                                         |                                          |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **Container size**        | 📦 Larger - includes MPI stack                           | 📦 Smaller - no MPI inside container     |
+---------------------------+----------------------------------------------------------+------------------------------------------+
| **Use case**              | When bind-mounts are restricted, host MPI not accessible,| When bind mounts are allowed, for        |
|                           | for portability and full container isolation             | simplicity and lightweight containers    |
+---------------------------+----------------------------------------------------------+------------------------------------------+

This Howto focuses on the Bind model. As the Bind model’s ``mpirun``
template above shows, this requires two steps:

1) Know where the MPI implementation on the host is installed (``MPI_DIR``)
2) Mount/bind the host MPI into the container in a location where the (container’s) system
   will be able to find libraries and binaries.

If one wishes to unlock the full MPI performance of a given HPC system,
Step 2 brings up another layer of bind-mounts not yet mentioned, while
applicable to probably the majority of HPC systems. On a hardware level,
many HPC systems use fast low-latency network technology like
InfiniBand. It is desirable for MPI-containers to stay on the Infiniband
highway, i.e., not have to revert to a slower network alternative.
This will require the software translators that link your MPI
programs to the InfiniBand hardware to be available inside a
container. Such translators are UCX and UCC.

What is UCX and UCC?
^^^^^^^^^^^^^^^^^^^^

UCX (Unified Communication X) is a high-performance messaging layer that
accelerates data transfer across HPC systems. It supports transports
like Infiniband, shared memory, TCP, and GPU interconnects. UCX acts as
the backbone for MPI implementations, accelerating data movement between
nodes and devices.

UCC (Unified Collective Communication) builds on UCX by optimizing
collective operations like broadcast, reduce, and barrier, as these are
essential for scaling parallel applications. UCC ensures that these
operations run efficiently across modern interconnects like Infiniband
and NVLink.

| Together, UCX + UCC allow MPI libraries and higher-level programs
  (using mpi4py) to fully exploit hardware acceleration for both
  individual and group communications. From low-level to high-level, the
  stack thus reads:
| Hardware (i.e., Infiniband, NVLink) ➡️ UCX (Transports) ➡️ UCC
  (Collectives) ➡️ MPI Library (e.g., OpenMPI) ➡️ Python (``mpi4py``)

Without going into more detail, the basic rule is:
To leverage high-performance messaging layers, that is, to use **Infiniband via UCX/UCC**,
an MPI-capable container requires the exposure of UCX/UCC-related libraries at build and runtime.
Here, build time refers to the compilation of ``mpi4py`` inside Apptainer, that is, one cannot
use a pre-built ``mpi4py`` that was compiled without UCX/UCC support.
Without  UCX/UCC support, our MPI program (or ``mpi4py``) would fall back to slower, less capable transports like TCP-only, or it
may fail to initialize. The following section describes the steps for building a UCX/UCC-capable
Apptainer image for the Bind-model approach.

Building a sample Bind-model container
--------------------------------------

| Scripts and supporting files are on the Ramses HPC cluster, directory:
| ``/projects/sw/Apptainer/Build_Examples/build_mpi4py/``

Step 1) Build base sandbox from Apptainer definition file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The starting point is the following SDF (Singularity/Apptainer definition file), to be
found under
``/projects/sw/Apptainer/Build_Examples/build_mpi4py/centos_stream9.def``

.. code::

   Bootstrap: docker
   From: rockylinux:9

   %labels
       Author Michael Commer @ ITCC
       Purpose "Python + mpi4py container using host MPI/RDMA stack on RHEL9"

   %environment
       export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
       export PATH=${MPI_PATH}/bin:/opt/venv/bin:$PATH
       export LD_LIBRARY_PATH=${MPI_PATH}/lib:/opt/lib:/opt/ucc:/opt/ucx:/opt/gcc:\
       /opt/hwloc:/opt/libfabric:/opt/numactl:/opt/gpfs:/opt/binutils:/opt/zlib:\
       /opt/libxml2:/opt/libpciaccess:$LD_LIBRARY_PATH
       export PYTHONPATH=/opt/venv/lib/python3.9/site-packages:$PYTHONPATH

   %post
       # Install system dependencies
       dnf install -y --allowerasing \
           python3 python3-pip python3-devel \
           gcc gcc-c++ make \
           util-linux git curl ca-certificates \
           && dnf clean all

       # Create mount points for host MPI and libraries
       export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
       for d in lib ucc ucx gcc hwloc libfabric numactl gpfs binutils zlib libxml2 libpciaccess; do
         mkdir -p /opt/$d
       done
       mkdir -p /etc/libibverbs.d $MPI_PATH

       # Create and activate Python virtual environment
       python3 -m venv /opt/venv
       source /opt/venv/bin/activate

   %files
   # copy specific MPI-related/RDMA driver libs into the container
   /lib64/libefa.so.1 /opt/lib/libefa.so.1
   /lib64/libibverbs.so.1 /opt/lib/libibverbs.so.1
   /lib64/libm.so.6 /opt/lib/libm.so.6
   /lib64/libnl-3.so.200 /opt/lib/libnl-3.so.200
   /lib64/libnl-route-3.so.200 /opt/lib/libnl-route-3.so.200
   /lib64/libpmi.so.0 /opt/lib/libpmi.so.0
   /lib64/libpmi2.so.0 /opt/lib/libpmi2.so.0
   /lib64/librdmacm.so.1 /opt/lib/librdmacm.so.1
   /lib64/libresolv.so.2 /opt/lib/libresolv.so.2
   /lib64/libuuid.so.1 /opt/lib/libuuid.so.1
   /lib64/libz.so.1 /opt/lib/libz.so.1
   /usr/lib64/libibverbs/libmlx5-rdmav34.so /opt/lib/libmlx5-rdmav34.so
   /usr/lib64/slurm/libslurm_pmi.so /opt/lib/libslurm_pmi.so

Building the sandbox is done as follows:

.. code:: bash

   apptainer -v build --fakeroot --sandbox sandbox_centos_stream9 centos_stream9.def

As a shortcut, you can use the helper-script command ``./build.sh bs``
instead. When done, you will see an unpacked directory tree
``sandbox_centos_stream9/``, referred to as a sandbox (image). Such a
sandbox is useful for development and debugging because you can always
re-enter it via ``apptainer shell --writable sandbox_centos_stream9/``
and install/fix things inside without rebuilding.

Step 2) Compile mpi4py inside sandbox
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, we want to rebuild ``mpi4py`` against the **host OpenMPI+UCX+UCC**
stack. This step involves

* loading the appropriate MPI-module:

  ``module purge && module load mpi/OpenMPI/4.1.5-GCC-12.3.0``
* bind-mouting MPI:
  
  ``... --bind ${MPI_PATH}:${MPI_PATH}:ro``
* bind-mounting UCX/UCC and other MPI-related libraries, for example:
  
  ``... --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro``

You can perform these steps via the helper-script: ``./build.sh is``,
which esentially performs all necessary bind-mounts:

.. code:: bash

   # Enter sandbox and manually compile mpi4py
   MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
   FMOD=mpi/OpenMPI/4.1.5-GCC-12.3.0
   module purge && module load $FMOD
   # ...
   apptainer shell --writable \
     --bind ${MPI_PATH}:${MPI_PATH}:ro \
     --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \
     --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \
     --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \
     --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \
     --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \
     --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \
     --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \
     --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \
     --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \
     --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \
     --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \
     --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \
   sandbox_centos_stream9

Once inside the sandbox (at the prompt ``Apptainer>``), launch the
``mpi4py`` compilation:

.. code:: bash

   Apptainer> LDFLAGS="-L/opt/ucx -L/opt/ucc -lucp -lucs -luct -lucc" pip install --no-binary=mpi4py --no-cache-dir --force-reinstall mpi4py

Building ``mpi4py`` inside Apptainer involves compiling Python bindings
that link against the MPI libraries and their underlying (UCX and UCC)
communication frameworks. In this case, it was found that the compiler
and linker did not find the associated UCX/UCC libraries automatically,
despite adequate settings of LD_LIBRARY_PATH, as LD_LIBRARY_PATH is not 
evaluated at compile time. Hence the additional
passing of ``LDFLAGS="-L/opt/ucx ..."`` to ``pip install``. On other systems, and
with other versions of MPI/mpi4py etc., it may become a trial-and-error
procedure to pass the correct linking-flags.

Step 2b, optional) Verify UCX/UCC transport inside Apptainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To make sure UCX libraries are present and contain required symbols:

.. code:: bash

   Apptainer> nm -D /opt/ucx/libucs.so | grep ucs_mpool_params_reset

which should produce something like
``00000000000233b0 T ucs_mpool_params_reset``, where the ``T``
represents the state of a defined symbol. If missing, UCX may be
outdated, incompatible or incorrectly linked. Afterwards, check UCC
linkage to UCX:

.. code:: bash

   Apptainer> ldd /opt/ucc/libucc.so.1 | grep libucs

which should produce something like
``libucs.so.0 => /opt/ucx/libucs.so.0 (0x00007fa8d88da000)``. Finally,
you could try out ``mpi4py``:

::

   Apptainer> python
   # Then run this 2-liner:
   >>> from mpi4py import MPI
   >>> print(f"Hello from rank {MPI.COMM_WORLD.Get_rank()} of {MPI.COMM_WORLD.Get_size()}")

Step 3) Build final image in Singularity image format (\*.sif)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When happy with the performance of the sandbox, you can build a final
immutable image:

.. code:: bash

   apptainer build centos_stream9.sif sandbox_centos_stream9

Again, this step can also be run through the helper-script:
``./build.sh bf``. Generally, keep the sandbox as long as you are still
iterating and expect to configure and debug inside the container. Create
the final ``.sif`` image once the build works and you want something
stable and portable.

Step 3b, optional) Perform simple ``mpi4py`` benchmark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can run an interactive slurm job with two nodes in order to test
node-to-node communication using the ``mpi4py.bench`` module:

.. code:: bash

   salloc --nodes=2 --ntasks-per-node=1 --job-name=n2xt1 --time=00:30:00
   # when interactive slurm job has started:
   module load mpi/OpenMPI/4.1.5-GCC-12.3.0
   ./build.sh r # runs "mpi4py.bench pingpong"

Assigning one task per node forces communication across the two nodes.
The helper-script command ``./build.sh r`` is a shortcut to the command

.. code:: bash

   m=67108868 # max packet size for benchmark test
   appt_tools mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m

where ``appt_tools`` is another shortcut to the above
``mpirun ... apptainer exec --bind ...`` command. The tool
``appt_tools`` is described further below. In the current case,
``appt_tools`` tries to ease your life by creating and launching a
script ``mpirun.sh`` containing the following lengthy
``mpirun``-command:

.. code:: bash

   MPI_DIR=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0
   FSIF=/projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif # copy of the final sif we build above
   FEXE="python -m mpi4py.bench pingpong --max-size 67108868" # command to run inside Apptainer
   mpirun -n 2 apptainer exec \
     --bind ${MPI_DIR}:${MPI_DIR}:ro \
     --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \
     --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \
     --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \
     --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \
     --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \
     --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \
     --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \
     --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \
     --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \
     --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \
     --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \
     --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \
   $FSIF $FEXE

The pingpong test shows this kind of output:

::

   # MPI PingPong Test
   # Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
            1              0.62 | 1.6172415e-06 ± 1.6920e-07    10000
            2              1.23 | 1.6257193e-06 ± 1.9910e-07    10000
            4              2.47 | 1.6217059e-06 ± 2.0745e-07    10000
            8              4.85 | 1.6494426e-06 ± 1.5699e-07    10000
          ...
      1048576          11412.46 | 9.1879942e-05 ± 5.1434e-07     1000
      2097152          11831.73 | 1.7724815e-04 ± 4.8272e-07       10
      4194304          12080.45 | 3.4719765e-04 ± 3.8304e-07       10
      8388608          12209.32 | 6.8706595e-04 ± 3.6534e-07       10
     16777216          12272.32 | 1.3670782e-03 ± 4.2741e-07       10
     33554432          12296.29 | 2.7288256e-03 ± 6.3523e-06       10
     67108864          12322.43 | 5.4460724e-03 ± 7.1724e-07       10

where a convergence of the value *Bandwidth [MB/s]* towards ~12000 MB/s
indicates the expected bandwidth with UCX/UCC transport (this holds for
a modest load on the compute nodes).

A side remark is that one may see the following kind of message upon
launching a containerized MPI-parallel application:

.. code-block:: text

   Open MPI's OFI driver detected multiple equidistant NICs from the current process,
   but had insufficient information to ensure MPI processes fairly pick a NIC for use.
   This may negatively impact performance. A more modern PMIx server is necessary to
   resolve this issue.

   Note: This message is displayed only when the OFI component's verbosity level is
   -79797552 or higher.

OpenMPI uses PMIx to determine process locality, such as which CPU
socket or NUMA domain a process is on. If PMIx cannot provide that info,
e.g., due to an older version or missing integration, OpenMPI’s OFI
(libfabric) transport layer fails to make smart NIC choices and falls
back to round-robin or default behavior. Hence, this is only a
performance advisory about *seeing* multiple NICs (e.g., Infiniband
interfaces) that are equally *close* to the process. There is an
internal setting

.. code:: bash

   export FI_PROVIDER=verbs
   export FI_VERBS_IFACE=ib0

carried out by ``appt_tools`` prior to launching ``mpirun`` which should
help avoid the OFI/NIC-related advisory.

appt_tools - a helper tool for containerized/apptainerized mpirun/sbatch jobs
-----------------------------------------------------------------------------

One of the main purposes of ``appt_tools`` is to provide shortcuts to
common apptainer command lines that involve ``mpirun`` through the Bind
model.

::

   $ mpirun -n <NRANKS> apptainer exec --bind "$MPI_DIR" <IMAGE> --bind ... </PATH/TO/MPI-PROGRAM/WITHIN/CONTAINER>

with potentially numerous ``--bind`` entries. While ``appt_tools`` has
other functionalities, here only the mpirun-feature is described. This
feature is activated via the ``mpirun`` subcommand:

.. code:: bash

   appt_tools [OPTIONS] mpirun <YOUR-MPI-OPTIONS>

which creates and instantly launches a run script ``mpirun.sh``. You can
use the option ``-o <OUTFILE>`` to choose a different output script
name. If you prefer not to launch the script right away, use

.. code:: bash

   appt_tools [OPTIONS] mpirunx <YOUR-MPI-OPTIONS>

which will only create ``mpirun.sh``. The helper-script ``build.sh``
contains two examples:

.. code:: bash

   #
   # appt_tools - mpirun Example 1: mpi4py.bench
   m=67108868 # max packet size for benchmark test
   # writes a bash script mpi4py.bench.sh and launches it
   appt_tools -V -o mpi4py.bench.sh mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m
   #
   # appt_tools - mpirun Example 2: simple MPI ring-communication test
   # writes a bash script mpirun.sh and launches it
   exe=/projects/sw/Apptainer/usr/bin/mpiinitst.4.1.5-GCC-12.3.0.exe
   appt_tools -V mpirun -n 2 $exe

which you can run with ``./build.sh r1`` and ``./build.sh r2``,
respectively.

appt_tools runtime parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the output of ``./build.sh r1`` or ``./build.sh r2``, one notes the
settings

.. code-block:: text

   MPI-module: mpi/OpenMPI/4.1.5-GCC-12.3.0           (already-loaded MPI-module mpi/* in parent-shell)
   MPI_DIR:    /projects/.../OpenMPI/4.1.5-GCC-12.3.0 (module prepend-path PATH)
   SIFFILE:    ./centos_stream9+mpi4py.sif            (Apptainer/Singularity-image-format file)

``MPI_DIR`` is required to mount the host’s MPI into Apptainer and is
evaluated from the setting for ``MPI-module``. ``SIFFILE`` provides the
image file (``<IMAGE>``) for the ``apptainer exec`` command. Different
options exist for providing these values to ``appt_tools`` as outlined
in the following.

appt_tools configuration (\*.ini) file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section describes runtime parameters that are needed when launching
*apptainerized* MPI programs. Different ways exist for providing runtime
parameters to ``appt_tools``. All runtime parameters have in common that
they can be set via command-line arguments or configuration (``*.ini``)
file settings. The ``appt_tools`` configuration (``*.ini``) file format
is not exactly the same as the common INI file format as it has the
general format

::

   [VAR] # value
   <VALUE>

or

::

   [VAR] # list of values
   <VALUE_1>
   ...
   <VALUE_N>

[mod]: MPI-module / MPI_DIR
^^^^^^^^^^^^^^^^^^^^^^^^^^^

| The host’s MPI-module and its associated bind/mount directory
  ``MPI_DIR`` can be provided in different ways. The following 5 ways
  are evaluated in the listed order until a setting for MPI-module is
  found. 
  
- Command-line option ``-m`` / ``--module``: You can provide
  the MPI-module via the command-line

  .. code-block:: bash

     appt_tools -m mpi/OpenMPI/4.1.5-GCC-12.3.0 mpirun ...

  This will let ``appt_tools`` extract ``MPI_DIR`` from the PATH
  information of the specified module, here ``mpi/OpenMPI/4.1.5-GCC-12.3.0``.

- ``module load`` line present in **slurm** file: Only applicable for
  runs that involve the ``sbacth``/``sbatchx`` subcommand. See the
  ``appt_tools`` documentation, to be invoked via ``appt_tools -doc``. -
  Setting ``[mod]`` in ``appt_tools`` configuration file. The
  configuration file is a file with the file suffix ``ini``. An adequate
  entry would then be

  .. code-block:: text

     [mod] # MPI-module
     mpi/OpenMPI/4.1.5-GCC-12.3.0

- MPI-module already loaded: If the
  above steps do not produce a setting for an MPI-module, the next
  attempt consists of figuring out if an adequate MPI-module is already
  loaded in the current environment. This is done internally via the
  ``module list`` command.

- Environment variable ``MOD_MPI``: The last
  option of providing MPI-module information consists of a variable
  setting like (in bash)

  .. code:: bash

     export MOD_MPI=mpi/OpenMPI/4.1.5-GCC-12.3.0

  If working with always the same MPI-module, setting ``MOD_MPI`` in
  your shell initialization file (i.e., ``~/.bashrc``) might be useful.

[sif]: IMAGE file (\*.sif)
^^^^^^^^^^^^^^^^^^^^^^^^^^

| Setting the Apptainer image file (``<IMAGE>``) can also be done in
  different ways, which are evaluated in the following order.

- Command-line option ``-s`` / ``--sif``: You can provide an absolute or
  relative file path for ``<IMAGE>``:

  .. code-block:: bash

     appt_tools -s /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif mpirun ...

- Setting ``[sif]`` in ``appt_tools`` configuration (``*.ini``) file.
  An adequate entry would be

  .. code-block:: text

     [sif] # SIFFILE
     /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif

- Find the newest (if multiple files are present) ``*.sif`` file in current
  directory.
  
- Environment variable ``APT_SIFFILE``: The last option of
  providing ``SIFFILE`` consists of a variable setting like (in bash)

  .. code-block:: bash

     export APT_SIFFILE=/path/to/your/file.sif

  If repeatedly working with the same ``SIFFILE``, setting
  ``APT_SIFFILE`` in your shell initialization file (i.e.,
  ``~/.bashrc``) might be useful.